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PRECIS 

RESEARCH  PROGRESS  REPORT 


Title;  "Annual  Report:  Automatic  Indexing  and  Abstracting,"  Annual  Progress 
Report,  Part  1,  Office  of  Naval  Research,  Contract  Nonr  4440(00). 

Background:  This  investigation  is  concerned  with  the  development  oi  lutomntic 
indexing,  abstracting,  and  extracting  systems.  Basic  investigations  in  E-  glish 
morphology,  phonetics,  and  syntax  are  pursued  as  necessary  means  to  this  end. 

Condensed  Report  Contents:  The  second  annual  report  on  automatic  indexing  and 
extracting  consists  of  8  papers  summarizing  progress  in  three  areas  of  investigation: 

(1)  Application  of  English  word  morphology  to  automatic  indexing  and  extracting 

(2)  Use  of  combined  syntactic  and  entropy  selection  criteria  in  automatic  indexing 

(3)  Studies  in  phonetic  English 

The  first  four  papers  are  concerned  with  the  relationship  between  the  part  of  speech 
of  words  and  their  graphic  form.  An  operational  definition  of  affixes  is  given,  the  use¬ 
fulness  of  affixes  in  the  automatic  determination  of  parts  of  speech  is  discussed,  and  an 
algorithm  is  outlined  for  determining  parts  of  speech  with  a  dictionary  look-up  of  less 
than  200  affixes  and  less  than  800  words.  The  inflection  of  adjectives  is  also  discussed, 
anticipating  the  need  for  future  refinement  of  the  part -of -speech  algorithm,  which  at 
present  identifies  li  part-of-spoech  categories.  Eor  some  objectives  these  categories 
may  be  inadequate,  necessitating  further  breakdown,  for  example  adjectives  might  lie 
further  distinguished  as  relative,  comparative,  etc. 

The  fifth  paper  is  a  progress  report  on  the  development  of  a  method  ior  automatic 
indexing  without  reference  to  any  pro-prepared  dictionary,  thesaurus,  etc  .  It  shows 
the  current  results  on  five  text  excerpts. 

The  final  three  pa|>ers  arc  concerned  with  the  relationship  lictwccn  English  phonetics 
and  English  morphology.  One  of  the  papers  is  concerned  with  homonyms,  which  repre¬ 
sent  a  problem  area  in  transformation  from  phonetic  to  graphic  English.  Another  dis¬ 
cusses  a  function  for  mapj  mg  written  English  into  »|*>kcn  English,  and  the  third  deserd*' 
a  computerized  stud)-  of  transcribed  English  phonetics  as  given  by  different  dictionaries. 

Eor  Further  Information  The  complete  report  is  available  in  l.*e  major  Navy  technical 
libraries  and  can  be  obtained  from  the  Defense  Documentation  Center.  A  few  copies 
are  available  for  distribution  by  the  author. 


FOREWORD 


The  issue  of  this  report  marks  the  completion  of  the  second  year  in  which  the 
Office  of  Naval  Research  has  contributed  support  to  the  program  of  research  in  the 
information  sciences  at  the  Palo  Alto  Laboratories  of  the  Lockheed  Missiles  &  Space 
Company. 

It  is  convenient  to  think  of  the  work  reported  here  as  dealing  with  a  data  base. 
During  the  first  year  of  the  program,  a  major  part  of  the  effort  went  into  establish¬ 
ment  of  the  data  base.  Illustration  of  its  nature  and  use  arc  provided  by  the  five 
volumes  of  The  English  Word  Speculum  which  was  distributed  to  ONR  program  par¬ 
ticipants  during  the  last  year.  In  this  report,  examples  of  exploration  and  application 
of  the  data  base  to  problems  in  linguistics  and  information  analysis  are  given.  Special 
attention  should  be  given  to  the  report  by  R.  P.  Mitchell  which  shows  how  the  research 
methods  developed  for  written  English  can  be  used  in  an  approach  to  the  problems  of 
synth'  is  and  recognition  of  spoken  English. 

One  part  of  the  year's  work  is  not  reported  here.  This  deals  with  development 
of  a  technique  for  obtaining  index  phras<  s  in  English  from  untranslated  Russian  tech¬ 
nical  texts.  T  is  work  is  described  in  a  separate  report. 

The  group  at  Lockheed  takes  this  opportunity  to  express  its  thanks  for  the  continued 
si  pport  and  encouragement  given  by  the  members  of  the  Information  Sciences  Branch 
of  the  Office  of  Naval  Research. 


B.  D.  Rudin 
Principal  investigator 
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ABSTRACTS 


1.  STRUCTURAL  DEFINITION  OF  AFFIXES  FROM  MULTISYLLABLE  WORDS 


In  July  1964,  H.  L.  Resnikoff  and  J.  L.  Dolby  presented  a  paper  at  the 
Bloomington  meeting  of  the  AMTCL  entitled  "The  Nature  of  Affixing  in  Written  English." 
In  that  paper,  an  algorithm  for  the  structural  definition  of  aifixes  was  developed  and 
applied  to  data  consisting  of  all  the  words  of  the  form  CVCVC  in  the  Shorter  Oxford 
English  Dictionary.  Fourteen  strong  prefixes  and  twelve  strong  suffixes,  seven  weak 
prefixes  and  forty  weak  suffixes  were  defined,  but  it  was  noted  that  all  the  affixes 
could  not  be  expected  to  show  up  in  two-vowcl-string  words.  This  paper  summarizes 
the  results  of  applying  a  modified  form  of  the  operational  definition  to  data  consisting 
of  all  the  four-,  five-,  six-,  and  seven-vowel -string  words  in  Webster**  Third  New 
International  Dictionary  of  the  English  Language.  Thirteen  additional  weak  suffixes, 
nineteen  weak  prefixes,  seventeen  strong  prefixes,  one  strong  suffix,  and  twelve 
possible  suffix-compounding  elements  wore  found. 


2.  PART-QF-SPEECH  IMPLICATIONS  OF  AFFIXES 


This  paper  describes  a  systematic  investigation  of  the  extent  to  which  the  pert-cf- 
■ pooch  of  words  can  be  identified  from  their  prefixes  and  suffixes.  The  results  indicate 
that  it  is  possible  to  detenr  Ine,  with  95  percent  accuracy,  the  inclusive  part  of  speech 
of  an  affixed  word  from  a  consideration  of  its  prefixes,  suffixes,  and  length. 
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By  "inclusive"  parts -of-speech  wc  mean  a  string  which  will  include  all  of  the  parts -of- 
speech  assigned  by  both  dictionaries  considered,  but  which  may  include  one  or  two 
extraneous  parts -of-speech.  The  extra  parts -of -speech  will  differ  according  to  the 
class  of  words,  as  adjectives  may  have  an  extra  part-of -speech  "noun"  or  "adverb," 
while  nouns  may  have  an  extra  part -of-speech  "verb."  The  part -of-speech  implications 
of  72  prefixes  and  of  87  suffixes  are  given. 

3.  ON  THE  INFLECTION  OF  WRITTEN  ENGLISH  ADJECTIVES 

The  inflection  of  adjectives  in  English  is  investigated  on  the  basis  of  the  numl)cr 
of  admissible  vowel  strings  contained  in  a  given  word.  Two  types  of  comparison  arc 
distinguished:  the  terminations!  and  the  analytic.  The  investigation  has  determined  a 
direct  relationship  between  the  inflection  of  adjectives  (a  given  set  of  adjectives  of  a 
certain  graphemicaliy  defined  type)  and  an  easily  observed  structural  property,  and 
the  following  claim  is  made: 

1.  The  standard  adjectives  in  W  (the  set  of  one-vowel -siring  words  which  ik) 
not  end  with  the  sequence  consonant  -Ic  is  denoted  by  W  )  which  are  not 
standard  adverbs  arc  inflected  analytically,  i  c. ,  by  using  more  and  most. 

2.  The  standard  adjectives  in  W  which  arc  also  standard  adverbs  are  inflected 
terminalionaily,  i.e. ,  by  using  the  suffixes  ~yr  and  -est 

In  view  of  the  discussion  of  the  relation  of  the  traditional  parl*-<*f-*pooch  classes 
to  structural  properties  of  English,  the  asserted  claim  takes  on  a  special  importance. 

It  asserts  that  the  set  of  adjectives  of  a  certain  graphemicaliy  defined  type  (namely, 
those  that  beh  ■ '  )  can  be  partitioned  into  two  classes,  one  containing  the 
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analytically  inflected  adjectives,  and  the  other  containing  the  term  [nationally  inflected 
adjectives.  This  partition  can  be  determined  solely  from  the  parts -of -speech  classes 
to  which  the  adjectives  belong. 


4.  AUTOMATIC  DETERMINATION  OF  PARTS  OF  SPEECH  OF  ENGLISH  WORDS 

A  procedure  for  automatically  assigning  part -of -speech  characteristics  to  English 
words  is  discussed  in  this  paper.  The  development  of  the  algorithm  is  traced,  and 
the  algorithm  itself  is  described  in  terms  of  three  basic  graphemic  rules,  which  are 
used  in  conjunction  with  a  group  of  affixes  (less  than  200)  and  a  list  of  exception  words 
(less  than  800)  whoso  part  of  speech  must  be  stored  in  the  computer  for  direct  look-up. 
The  results  of  a  test  of  the  algorithm  on  a  500-word  random  sample  from  a  73, 582 
word  dictionary  are  given.  Ninety-fivo  percent  of  the  samples  are  assigned  the  correct 
"inclusive"  part -of -speech  string,  where  the  inclusive  string  is  defined  as  including 
all  parts  of  speech  given  in  the  dictionary,  but  which  may  include  one,  or  rarely  two, 
extra  parts  of  speech. 


5.  A  SYNTACTIC -STATISTICAL  METHOD  FOR  AUTOMATIC  INDEXING 

The  objectives  and  method  of  an  algorithm  developed  for  automatic  indexing  are 
brieQy  presented.  The  reduction  level  achieved  by  the  algorithm  is  indicated  and  the 
results  of  tests  on  five  text  excerpt*  are  shown. 
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6.  STATISTICS  OF  OPERATIONALLY  DEFINED  1IOMONYMS 
OF  ELEMENTARY  WORDS 


This  computerized  study  of  the  homonyms  of  elementary  words  (roughly  equivalent 
to  monosyllabic  words)  has  allowed  the  compilation  ef  exhaustive  lists  of  homonym 
sots,  using  phonotic  transcriptions  from  five  different  d‘  lionaries.  Of  the  5,700 
elementary  words,  3, 000  were  involved  in  at  least  one  homonym  set,  indicating  that 
homonyms  will  present  a  significant  problem  in  mechanized  word  recognition.  The 
effects  on  the  homonym  sets  of  changing  from  the  phonetic  transcription  of  one  dic¬ 
tionary  to  another  were  tabulated,  os  were  the  effects  of  removing  dialectal  pronunci¬ 
ations.  Since  the  effects  of  dialectal  variations  turned  out  to  be  relatively  small,  it 
was  possible  to  categorize  and  list  for  study  the  actual  words  whose  dialectal  pro¬ 
nunciations  caused  homonym -type  confusion  with  other  words. 


7.  ACOUSTIC  PHONETIC  TRANSCRIPTION  OF  W.tlTTEN  ENGLISH 

The  function  that  maps  the  words  of  written  English  onto  the  corresponding  words 
of  spoken  English  is  described.  The  simplest  hypothesis  is  that  the  function  F  , 
defined  on  the  symbols  forming  the  letters  of  the  afjdud»et,  mat**  each  letter  onto  a 
sound  and  maps  tho  sequence  of  letters  l.jl.,t  as  Fit  jl  .,)  F(Lj)F(I..,)  -  This 

hypothesis  is  false,  since  F  is  not  always  well  defined  in  the  sense  that  its  values  are 
not  always  unique  and  the  equation  does  not  always  persist.  On  the  basis  of  an  exhaus¬ 
tive  dictionary  starch,  we  have  shown  that  it  is  possible  to  define  F  in  a  context- 
dependent  manner  such  that  its  restriction  to  consonant  brings  is  uniquely  defined. 
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with  this  new  definition,  the  equation  holds  for  consonant  strings  of  the  grammatically 
homogeneous  one-vowel -string  words  of  written  English;  the  consonant  strings  in  these 
words  coincide  with  those  uninfluenced  by  the  rules  of  ouphonic  combination. 

8.  COMPUTER  STUDY  OF  TRANSCRIBED  ENCASH  PHONETICS: 

A  PROGRESS  REPORT 

A  summary  of  a  comjHiter -oriented  stud)  of  the  relations  between  orthographic  and 
transcribed  frftonctic  forms  of  elementary  English  words  is  presented.  The  principles 
used  to  generate  transcribed  phonetic  data  from  the  orthography  of  elementary  words 
are  descrilied.  The  computer  programs  which  cmiMxly  these  principles  were  used  to 
accurately  obtain  phonetic  data  contained  in  five  authoritative  sources,  yielding 
phonetic  transcriptions  for  several  iccogntzed  dialects  oi  English. 

These  data  were  analysed  and  several  important  results  were  obtained.  A  brief 
nummary  of  these  results  is  presented,  among  them  (a)  the  significance  of  homonyms, 
(h)  predictable  dependencies  of  vowel  values  upon  consonant  values,  and  (c)  the  isola¬ 
tion  of  phonetic  segments  which  are  independent  of  transcriptions  in  which  they  may 
occur. 
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INTRODUCTION 


During  the  past  year,  experiments  in  automatic  indexing  have  proceeded  in  four 
areas,  three  of  which  are  covered  in  this  report: 

(1)  Application  of  English-word  morphology  to  automatic  indexing  and  extracting 

(2)  Automatic  indexing  using  combined  syntactic  and  entropy  selection  criteria 

(3)  Studies  in  phonetic  English 

Each  section  comprises  papers  describing  specific  efforts  within  that  area.  The  first 
phase  of  the  fourth  area  of  research,  direct  Russian-lo-English  indexing,  has  been 
completed  and  is  documented  in  a  separate  volume  of  this  report,  Part  II. 

The  first  section  of  this  report  is  a  continuation  of  the  last  annual  report.  As  that 
report  indicated,  only  G6  percent  of  the  300  words  were  assigned  a  correct  part-of- 
speech  string  by  the  part-of-speech  algorithm,  whereas  95  percent  accuracy  is  desired 
for  the  indexing  experiment.  Therefore  a  more  complete  study  of  the  two  crucial 
factors  in  the  algorithm  was  undertaken: 

•  The  determination  of  affix  sequences  and  of  their  part-of-speech  implications 
(Papers  1  and  2) 

•  Exceptions  to  the  basic  theoretical  premise  that  words  with  one-syllable  kernels 
have  parts  of  speech  noun,  adjective,  and  verb,  while  those  with  multisyllable 
kernels  have  parts  of  speech  noun  and  adjective  (Paper  3) 

These  studies  indicated  that  95  percent  accuracy  is  not  obtainable  from  considera¬ 
tions  of  vowel-string  and  word  affixing.  Accordingly,  the  goal  of  95  percent  accuracy 
was  dropped,  at  least  temporarily,  in  favor  of  a  more  obtainable  goal  of  95  percent 
"inclusive  accuracy,"  wherein  the  string  assigned  to  a  word  includes  all  parts  of  speech 
given  by  the  dictionaries,  but  which  may  include  one,  or  rarely  two,  extraneous  parts 
of  speech.  This  decision  was  based  on  the  judgment  that  in  most  utilizations  of 

1-1 


LOCKHEED  MISSILES  &  SPACE  COMPANY 


parts -of-spcech  information,  it  is  easier  to  eliminate  or  compensate  for  extra  parts 
of  speech  than  to  infer  the  existence  of,  or  compensate  for,  missing  parts  of  speech. 
The  goal  of  95  percent  inclusive  accuracy  has  been  reached  on  the  500-word  random 
sample  from  the  dictionary.  The  modified  part-of-speech  algorithm  can  now  be 
recoded  for  the  IBM  360  Model  30  now  available  to  the  research  lalioralory.  and 
coding  of  other  programs  necessary  to  the  compilation  of  a  sentence  dictionary  can 
proceed.  The  sentence  dictionary  will  be  used  to  investigate  the  relationship  between 
syntax  and  "indexible"  sentences,  as  described  in  the  first  annual  ro|M>rl.  U  is  also 
expected  to  play  a  more  general  role  in  the  study  of  syntax,  analogous  to  that  played 
by  word  dictionaries  in  the  study  of  morphology.  It  will  provide  an  ordered  list  of 
observed  sentence  constructions  which  will  be  useful  both  in  the  derivation  and  testing 
of  syntactic  algorithms. 

The  second  section  gives  the  results  to  date  of  an  evolving  algorithm  lor  combining 
syntactic  and  entropy  criteria  in  the  automatic  selection  of  index  items  and  phrases. 

In  this  experiment,  a  parsing  program  is  used  to  select  the  syntactic  units  U|x>n  which 
the  entropy  criteria  is  then  imposed.  Because  of  a  change  in  computers,  it  has  been 
possible  to  complete  tests  on  only  five  text  fragments.  Further  refinements  and  tests 
must  await  the  recoding  of  the  programs  for  the  IBM  3<>0  Model  30  now  in  use  by  the 
Resea rch  Lalxi rate ry . 

The  third  section  describes  progress  in  the  investigation  ol  relationships  between 
the  phonetic  and  graphcmic  forms  of  Knglish  words.  Such  studies  are  expected  to  snake 
it  practicable  to  use  human  sj  icech  as  input  to  and  output  from  a  computer  programmed 
for  any  desired  automatic  processing  of  language. 
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I 

APPLICATION  OF  ENGLISH  WORD  MORPHOLOGY 
TO  AUTOMATIC  INDEXING  AND  EXTRACTING 


1.  STRUCTURAL  DEFINITION  OF  AFFIXES  FROM  MULTISYLLABLE  WORDS* 

L.  L.  Earl 

In  this  paper  the  goal  is  to  define  affixes  from  structural  criteria  alone.  The 
problem  of  when  an  affix  sequence  is  genuinely  acting  as  an  affix  (as  re  may  be  con¬ 
sidered  a  prefix  in  react  but  not  in  read)  will  not  lx?  considered,  although  the  cate¬ 
gorization  into  strong  and  weak  affixes  is  intended  to  anticipate  this  problem.  The 
validity  of  the  defined  affixes  will  be  indicated  only  by  comparison  with  existent  affix 
list3.  A  more  utilitarian  evaluation  of  affix  validity  can  be  made  after  the  syntactic 
and  phonetic  implications  of  the  defined  affixes  have  been  investigated. 

The  definitions  for  affixes  given  in  this  paper  are  essentially  unchanged  from  the 
definitions  given  by  Dolby  and  Resnikoff, *  but  are  extended  to  include  both  one-  and 
two-syllable  affixes.  The  data  set  to  which  these  definitions  are  applied  is  the  four-, 
five- ,  six-,  and  sevcn-vowcl-string  words,  a  set  of  about  11,  250  words.  From  this 
set  the  one  vowel -string  affixes  which  did  not  occur  in  the  two -vowel -string  data  set 
(used  in  Reference  1)  will  be  defined,  along  with  the  two-vowel-string  affixes  which 
could  not  have  occurr^tl  in  the  two -vowel -string  data. 

The  extended  definition  for  strong  prefixes  can  be  summarized  as  follows  (conso¬ 
nant  strings  referred  to  in  the  definition  are  given  in  Table  1-1):  Given  a  word  of  the 
form  VjC.jV^C^V^  ....  if  either  Ct>  or  C  is  an  inadmissible  consonant  string, 
there  it.  a  mandatory  syllable  break  within  the  string,  and  everything  preceding  that 
break  is  defined  as  a  prefix  {xissibility.  A  prefix  jjossibility  is  defined  as  a  prefix 
prolability  if  in  the  data  there  are  at  least  four  words  with  the  same  prefix  possibility 

•Thus  work  was  supported  by  the  Office  of  Naval  Research  and  by  the  independent 
Research  Program  of  Lockheed  Missiles  ft  Space  Company 
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Tabic  1-1 


* 


arising  from  the  same  consonant  string.  A  prefix  probability  becomes  a  strong  prefix 


if  the  same  prefix  probability  arises  from  two  or  more  inadmissible  consonant  strings. 
The  definition  for  strong  suffixes  is  analogous,  proceeding  from  the  other  end  of  the 
word.  Thus,  given  a  word  of  the  form  . . .  VgC’.jV^C^VjCj,  ^  c*l^cr  ^  or  ^3  *8  an 
inadmissible  spring,  there  is  a  mandatory  syllabic  break  withir  the  siring,  and  every¬ 
thing  following  that  break  is  defined  to  lie  a  suffix  possibility.  Then  the  definition  for 
suffix  probability  and  for  strong  suffix  is  the  same  as  for  prefixes;  the  word  suffix 
can  be  substituted  for  the  word  prefix  wherever  it  occurs.  The  consonant  string 
may  be  blank  in  either  case.  'Hie  criterion  of  four  or  more  words  in  establishing  an 
affix  probability,  and  the  criterion  of  two  or  more  consonant  strings  in  defining  an 
affix  from  a  probability,  were  established  in  Reference  1.  These  criteria  were  estab¬ 
lished  hcuristically,  and  have  been  retained  here  not  only  for  the  sake  of  consistency 
but  also  because  they  were  proven  effective. 

The  definition  i<  r  weak  affixes  has  also  been  extended  to  include  two-syllable 

affixes.  Weak  affixes  arc  so  classified  because  their  definition  is  based  on  a  probable 

syllabic  break  rather  than  a  mandatory  break.  Because  such  probable  breaks  arc  not 

interior  to  a  consonant  string,  weak  prefixes  end  with  a  vowel  and  weak  suffixes  begin 

with  a  vowel.  For  orefixes,  given  a  word  of  the  form  ' ' '  ’  ^  either 

or  C.{  is  an  admissible  initial  string  but  not  an  admissible  final  string,  everything 

preceding  that  consonant  string  is  a  prefix  possibility.  For  suffixes,  given  a  word  of 

tile  form  .  .  .  V„C„V ,,C  V  C  ,  if  either  C’„  or  C„  is  an  admissible  final  string  but  not 
.»  a  2  2  l  l  2  a 

un  admissible  initial  string,  everything  following  that  consonant  string  is  a  suffix  jjossi- 
oility.  The  criteria  by  which  an  affix  possibility  becomes  an  affix  are  the  same  as  for 
strong  affixes.  Note  that  these  definitions  exclude  admissible  final  strings  from 
or  C  for  prefixes,  and  admissible  initial  strings  from  C  o  'or  suffixes,  to 
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increase  the  reliability  of  the  definition  by  reducing  the  probability  of  postulating  a 
break  before  (for  prefixes)  or  after  *for  suffixes)  C  or  Cg  where  a  break  docs  not 
exist.  Consider  the  prefix  ease  first.  If  C 2  or  is  an  admissible  initial  string, 
and  also  an  admissible  ending  string,  the  syllabic  break  could  be  logically  either 
before  or  after  the  string.  The  string  CH  is  such  a  string,  as  the  following  words 
illustrate . 

enrich /ment  ta/chometcr 

poach/er  re /christen 

By  eliminating  such  doubtful  strings  we  should  increase  somewhat  the  reliability 
of  the  definition  of  our  prefix  possibilities,  but  we  do  not  completely  eliminate  chance 
for  error,  because  even  with  initial  strings  that  are  not  also  final  strings,  a  break 
may  occur  internal  to  a  multiletter  string  or  after  a  single  letter  string.  The  strings 
BR  and  GR  are  such  multiletter  strings,  as  the  following  words  illustrate. 

sub/routinc  ag/riculture 

re/broadcast  dc/grce 

The  chances  of  this  happening  in  two  multiletter  strings  with  the  same  prefix  inissi- 
bility  is  judged  small  enough  to  be  discounted,  since  here  we  are  simply  defining  prefix 
sequences.  The  chances  of  error  due  to  a  break  after  a  single  letter  seems  greater, 
as  with  the  letter  S. 

re/sidual 

res/ident 

However,  since  there  are  only  three  single  consonants  which  are  loginning  but  not 
ending  strings  (J,  S,  V),  and  since  again  it  takes  two  consonant  strings  to  cause  a 
sequence  to  be  defined  as  an  affix,  this  problem,  too,  can  Ik*  discounted 
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It  is  suspected  that  the  situation  for  suffixes  is  more  difficult  in  that  the  set  of 
terminal  consonant  strings  left  after  removing  initial  strings  has  more  members  which 
show  a  tendency  to  break  internally .  For  example,  breaks  in  the  following  strings  are 
common. 


c/t 

as  in 

lac/tatc 

m/b 

as  in 

am/bition 

r/t 

as  in 

fcr/tile 

m/p 

as  in 

am/pcrc 

p/t 

as  in 

ap/titude 

r/i 

as  in 

pur /loin 

r/b 

as  in 

ar/l»or 

n/d 

as  in 

ban/d  it 

Therefore,  more  diffic  illy  m  determining  when  a  defined  weak  suffix  is  actually 
acting  as  a  suffix  in  a  given  word  could  reasonably  be  anticipated.  It  would  lie  interesting 
to  subject  each  of  the  weak  suffixes  to  a  qualifying  test,  namely  that  in  the  two-syllable 
data  set  there  not  be  two  sets  of  iiiegal  strings  preceding  the  suffix,  where  each  had  at 
least  four  members.  When  this  test  was  applied  to  the  five  suifixes  a,  age,  ah,  ent, 
oek,  iwo  of  the  suffixes,  a  and  oek,  failed  the  test.  But,  both  a  and  oek  obviously  some¬ 
times  act  as  suffixes  (they  are  both  listed  in  the  dictionaries  as  such),  so  it  is  unwise 
to  eliminate  them  at  this  |x>int  in  the  research.  What  is  indicated,  perhaps,  is  the 
structural  classification  of  the  weak  suffixes  by  degree  of  weakness,  as  a  means  of 
approaching  the  suffix -ir.-conuxt  problem. 

Table  i -1!  reviews  the  prefixes  ami  suffixes  defined  in  Heference  1,  using  the  two- 
vowel-string  words  as  the  data  set.  Table  1-3  shows  the  new  *  uffixe  defined  using 
four-,  five-,  six-,  and  seven-vowel  string  wort  Is,  with  the  preceding  letter  string*  ami 
occurrence  counts  whi?h  established  them  as  suffixes  Surprisingly,  there  is  only 
one  that  can  lx*  considered  a  strong  suffix,  and  tlutt  actually  turned  up  as  the  weak 
suffix  at  ion-  Since  ail  of  the  preceding  letter  strings  turned  out  to  Iv  of  the  form  <‘l 
(where  C  -  c,  1.  n,  or  r).  and  since  phonetic  breaks  were  consistently  before  the  $, 
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Table  1-2 


AFFIXES  FROM  TWO-VOWEL  STRING  WORDS 
Strong  Prefixes  Strong  Suffixes 


ac 

in 

ful 

iy 

ad 

mis 

land 

lock 

al 

out 

ler 

man 

con 

sub 

less 

menl 

dis 

sun 

let 

mess 

cn 

trans 

ling 

ward 

ex 

us 

Weak  Prefixes 

Weak  Suffixes 

a 

a 

in 

con 

be 

age 

ine 

ue 

cy 

ah 

ing 

cr 

de 

al 

ion 

um 

e 

an 

is 

el 

i 

ant 

ish 

ure 

re 

ar 

ite 

ic 

ard 

ive 

us 

at 

O 

ic 

cd 

ock 

ier 

ee 

on 

ilc 

el 

or 

on 

ol 

(Mil 

ow 

l-<) 


LOCKHEED  MISSILES  &  SPACE  COMPANY 


Tabic  1-3 


(as  in  plantation),  it  seemed  reasonable  to  consider  tation  a  strong  suffix.  Of  the 
thirteen  newly  defined  suffixes,  able,  ial,  ate,  ist,  ism,  y,  ous,  ian,  ium,  ia,  and  ide, 
are  all  commonly  recognized  as  such,  while  only  tation  or  ation  and  is  are  not. 

It  was  expected  that  more  than  one  two-vowel -string  suffix  would  materialize. 
Instead,  a  number  of  sequences  were  observed  which  appear  to  act  as  inner  suffixes, 
or  suffix  compounding  elements,  which  occur  frequently  in  combination  with  one- 
syllable  suffixes.  Thus,  the  sequence  Uc  is  frequently  encountered  followed  by  al,  ize, 
or  ide  to  form  tical,  ticism,  ticize,  ticide  as  in  elliptical,  asepticism,  didacticism, 
ascepticize,  romanticize,  and  infanticide.  Such  interior  sequences  which  meet  the 
occurrence  criteria  set  up  for  suffixes  are  listed  in  Table  1-4.  It  is  expected  that 
these  sequences  will  have  little  syntactic  meaning  but  may  be  helpful  in  word  hyphena¬ 
tion  techniques. 

Table  1-5  shows  the  prefixes  defined  using  four-,  five-,  six-,  and  scven-vowol- 
string  words,  with  the  iollowing  letter  strings  and  occurrence  counts  which  established 
them  as  prefixes.  The  three  newly  defined  strong  two -syllable  prefixes  circum,  inter, 
and  hyper  are  well  known.  Three  other  common  prefixes,  over,  under,  and  super, 
were  encountered  with  a  good  many  letter  strings,  but  always  failed  to  meet  the  require¬ 
ment  of  more  than  three  occurrences  with  a  given  letter  string. 

Of  the  strong  one-syllable  prefixes  defined,  ab,  at,  ap,  com,  an,  cm,  im,  and  oc 
are  recognized  by  dictionaries,  while  vul  is  not.  Of  the  weak  two-syllable  prefixes, 
auto,  demo,  iso,  photo,  cpi ,  and  tele,  are  commonly  recognized,  but  ana,  apo,  deni, 
and  irre*  are  not.  None  of  the  one-syllable  weak  prefixes  (au,  ca,  hy,  ma,  mi,  lu, 
pro,  sa,  su,  v[)  are  familiar  as  meaningful  prefixes  except  for  pro.  Therefore,  the 
next  step,  in  which  the  part  of  speech  implications  of  the  structurally  defined  affixes 

*irre  is  no  doubt  a  combination  of  the  recognized  prefixes  i  and  re. 
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Table  1-4 

ELEMENTS  COMBINING  WITH  SUFFIXES 


Suffix 

Terminal 

Compounding 

Letter’ String 

No.  of 

Element 

Associated 

Occurrences 

-cat- 

rc 

9 

nc 

12 

-mat- 

rm 

22 

mm 

18 

-pos- 

inp 

8 

i*P 

6 

-pat- 

lp 

6 

rp 

G 

-sit- 

ns 

8 

rs 

5 

ss 

5 

-sat- 

ns 

12 

rs 

5 

ss 

16 

-tat- 

It 

16 

nt. 

46 

rt 

11 

-tur- 

cl 

G 

et 

19 

nt 

8 

-tic- 

ct 

13 

nt 

7 

pt 

13 

-tor- 

ct 

33 

nt 

G 

-ter- 

ct 

8 

ct 

9 

nt 

8 

Pt 

44 

-tin- 

nt 

6 

r* 

G 
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Tabic  1  -5  (Coiit. ) 


Prefix 

Following 
Letter  String 

No  of  Occurrences 
of  Prefix  Preceding 
Given  Letter  String 

mi 

cr 

69 

thr 

4 

pro 

si 

6 

pr 

4 

s& 

cr 

8 

pr 

5 

su 

bl 

6 

pr 

11 

sc 

5 

vi 

hr 

8 

sc 

5 

tr 

4 

j  Strong  Prefixes 

No.  of  Occurrences 

IX,' lining 

of  Prefix  With  Given 

Prefix 

Letter  String 

Letter  String 

at 

tm 

15 

ttr 

11 

ap 

ppl 

15 

ppr 

46 

an 

mlr 

18 

ngl 

9 

nh 

6 

nth 

Lit) 

nthr 

35 

cm 

mbl 

13 

mbr 

30 

im 

mbr 

:> 

31 

mpr 

66 

com 

mpl 

38 

mpr 

13 

vui 

le 

6 

In 

. 

l 

) 


l-il 


lockheeo  missiles  a  space  company 


arc  investigated,  will  be  especially  interesting  for  this  group.  It  is,  in  fact,  the  next 
step,  in  which  the  various  applications  and  implications  of  the  structurally  defined 
affixes  arc  investigated,  that  the  utility  and  therefore  the  validity  of  these  structural 
definitions  will  be  tested. 

ACKNOWLEDGMENT:  The  author  wishes  to  thank  Dan  L.  Smith  for  writing  many  of 
the  computer  programs  used  in  deriving  the  affixes. 
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2.  PART-OF-SPEECII  IMPLICATIONS  OIr  AFFIXES* 


L.  L.  Earl 

In  a  highly  inflected  language,  the  structure  of  a  word  is  indicative  of  its  syntactic 
role.  A  relationship  between  form  and  part-of-speech  might  also  be  expected  in  English, 
a  language  not  highly  inflected  but  closely  related  to  more  inflected  languages.  Such  a 
relationship  was  noted  by  J.  Dolby  and  H.  Resnikoff1  who  show  that  a  high  percentage 
of  a  set  of  words  called  "elementary  words"  (roughly  equivalent  to  the  set  of  one-syllable 
words)  can  be  used  as  nouns,  adjectives,  or  verbs,  while  a  high  percentage  of  the 
remaining  multisyllable  words  can  be  used  only  as  nouns  or  adjectives.  If  this  relation¬ 
ship  can  be  regarded  as  a  general  rule,  and  if  subrules  can  be  developed  to  cover  the 
considerable  number  of  exceptions  to  the  general  rule,  it  will  be  possible  to  identify 
part-of-speech  by  algorithm.  Intuitively,  it  would  be  expected  that  prefixes  and  suffixes 
are  key  structural  elements;  this  expectation  is  reinforced  by  the  structure  of  the 
European  languages  whose  beginnings  and  endings  indicate  the  grammatical  properties 
of  words. 

A  logical  step  in  an  effort  to  classify  words  from  their  structure  is  to  examine  the 
relationship  between  the  affixes  of  words  and  their  part-of-speech  possibilities  as 

listed  in  a  dictionary.  The  part-of-speech  information  from  The  Shorter  Oxford 

2  3 

Dictionary"  and  from  the  Merriam  Webster  New  International  Dictionary  was  recorded 

on  magnetic  tape.  A  computer  was  used  to  correlate  the  affixes  of  words  with  their 

part-of-speech  possibilities.  A  total  of  73,582  words  was  recorded,  but,  of  course. 

not  all  of  these  words  contain  affixes. 

•This  work  was  supi>ortod  in  part  by  the  Oluce  of  Naval  Research;  the  computer  time 
was  supported  by  the  Independent  Research  Program  of  Lockheed  Missiles  &  Space 
Company. 
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The  first  problem  encountered  is  that  of  selecting  a  list  of  affixes.  Two  sets  of 

affixes  have  been  selected,  the  first  being  the  operationally  defined  affixes  derived  from 

4  5 

dictionaries  solely  on  graphemic  evidence,  ’  and  the  second  being  all  "beginnings  or 
endings"  listed  in  A  Dictionary  of  Modern  English  Usage0  which  were  not  already  on  the 
first  list.  Both  lists  are  given  in  Table  2-1.  The  inflectional  suffixes  ed  and  ing  and 
the  adverbial  iy  were  not  considered  in  this  study  because  they  have  well  recognized 
implications.  It  is  believed  that  the  number  of  words  ending  in  ed,  ing,  or  1^  whose 
parts -of- speech  differ  from  the  expected  is  small  enough  so  that  such  words  can  be 
listed  as  exceptions. 

The  second  problem  encountered  is  that  of  determining  when  an  affixing  unit  is 
acting  as  an  affix  in  a  given  word,  as  re  is  a  prefix  in  react  but  not  in  read.  This 
problem  is  complicated  by  an  uncertainty  as  to  what  the  words  prefix  and  suffix  signify. 
It  is  difficult  to  determine  from  the  definitions  currently  in  use  to  what  unit  an  affix  is 
expected  to  attach  (word,  stem,  or  syllable),  to  what  extent  the  function  of  an  affix  is 
semantic,  and  to  what  extent  the  affix  should  indicate  phonetic  syllabic  boundaries 
(as  pro  indicates  syllabic  boundaries  in  prefix  but  not  in  preface).  Since  we  hope  to 
use  affixes  in  determining  part-of- speech  from  form  alone,  we  will  use  a  formal 
definition.  For  purposes  of  this  study,  an  affix  will  be  recognized  as  an  affix  under 
only  two  formal  and  reproducible  conditions.  First,  the  unit  to  which  any  affix  attaches 
nund  contain  one  or  more  vowel  strings.  Second,  the  unit  to  which  any  prefix  attaches 
must  begin  with  an  admissible  initial  cons  onant  string,  and  the  unit  to  which  any  suffix 
attaches  must  end  with  an  admissible  terminal  consonant  string  The  admissible 
initial  and  terminal  strings,  whose  derivation  is  given  in  Reference  i,  are  listed  in 
Tabic  2-2.  Refinements  of  these  rules  arc  possible,  lo  produce  a  closer  correspond¬ 
ence  with  any  given  definition,  but  these  criteria  seem  adeifiatc  for  our  purpose*. 
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Table  2-1 


AFFIXES  SELECTED  FOR  CORRELATION 


Affixes  Set  I 


Prefixes 


Suffixes 


a 

dig 

<*!> 

a 

ia 

lock 

ab 

e 

out 

ah 

ic 

man 

ac 

ec 

photo 

ai 

ie 

ment 

ad 

cm 

pro 

an 

in 

ness 

al 

en 

re 

ar 

is 

o 

an 

epi 

sa 

at 

iai 

on 

ap 

ex 

sac 

age 

ier 

or 

at 

hy 

sub 

ant 

tie 

ot 

ana 

hyper 

sun 

ard 

ine 

ow 

a|ie 

i 

tele 

ate 

ion 

ock 

auto 

im 

trans 

able 

ish 

talion 

Ih* 

in 

un 

ec 

ism 

ue 

ca 

iucon 

uncon 

el 

isl 

urn 

fir  cum 

inex 

Vi 

en 

ite 

us 

com 

inter 

vul 

cr 

ium 

ure 

con 

irre 

cl 

ive 

ward 

cy 

lu 

ey 

ier 

y 

dc 

ma 

ent 

let 

demo 

mi 

eon 

land 

deni 

mis 

fui 

less 

Affixes  Set  Q 


Prefixes  Suffixes 


air 

foi- 

ae 

ise 

ty 

aero 

fo  re 

ai 

1st 

ular 

bi 

heeto 

as 

»ty 

v&Nni 

by 

homo 

cy 

i/.e 

ways 

bye 

non 

ex 

ible 

worthy 

brain 

para 

eer 

iana 

eo 

self 

cm  lily 

eenti 

semi 

rsl 

lofty 

doc  a 

super 

cite 

latry 

deei 

vice 

genie 

phile 

demi 

yesler 

ix 

Hi 
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Table  2-2 


INITIAL  AND  TERMINAL  STRINGS 


ADMISSIBLE  INITIAL  CONSONANT  STRINGS  OF  CVC  WORDS 


B 

N 

BL 

GL 

SII 

TR 

sen 

C 

P 

BR 

GN 

SK 

TW 

SCR 

D 

Q 

C1I 

GR 

SL 

WII 

SHU 

F 

R 

CL 

KN 

SM 

WR 

SPIt 

G 

S 

CR 

KR 

SN 

SPL 

11 

T 

DR 

PM 

SP 

SPR 

J 

V 

DW 

PL 

SQ 

STR 

K 

W 

FL 

PR 

ST 

Tim 

L 

Z 

FR 

RII 

SW 

TIIW 

M 

GII 

SC 

Til 

ADMISSIBLE  FINAL  CONSONANT  STRINGS 

OF  CVC  WORDS  NOT 

ENDING  WITH  E 

B 

BB 

MP 

SM 

GMT 

C 

CM 

ND 

SK 

LCM 

D 

CK 

NG 

SM 

LPM 

F 

CT 

NK 

SP 

LTM 

G 

DD 

NN 

ss 

M  PII 

II 

FF 

NT 
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To  correlate  the  affixes  in  'ruble  2-1  with  parts  of  speech,  a  computer  program  was 
written  to  examine  all  double  standard*  words  with  two  or  more  vowel  strings.  It  sorted 
out  all  words  that  had  an  affix,  that  is,  a  beginning  or  ending  that  matched  a  member 
of  the  affix  list  and  met  the  established  criteria.  Each  of  these  words  had  a  part-of- 
spcech**  string  given  for  it,  that  is,  the  list  of  parts -of-speech  possible  for  that  word. 
Since  the  dictionaries  do  not  always  agree,  the  string  is  taken  as  the  parls-of-spccch 
that  are  associated  with  standard  meanings  of  the  word  in  cither  dictionary.  The 
program  associated  the  parl-of-spccch  string  of  a  given  word  with  that  word's  prefix 
or  suffix.  Up  to  nine  different  strings  could  be  associated  with  an  affix.  For  each 
affix,  a  count  of  the  number  of  words  with  that  affix  was  made  for  each  encountered 
part-of-spccch  string,  with  the  counts  divided  according  to  the  number  of  syllables  in 
the  words.  The  following  example  will  help  to  clarify. 

The  result  for  the  prefix  inter  is  shown  in  Fig.  2-1.  A  J.  indicates  presence  in 
the  dictionary  of  the  part-of -speech  identified  by  the  abbreviation  at  the  head  of  the 
column.  Thus,  the  first  line  of  Fig.  2-1  indicates  that  the  first  part -of-speech  string 
encountered  in  the  words  prefixed  with  inter  was  noun  and  verb,  and  that  there  were 
23  total  words  with  this  parl-of-spccch  string,  one  or  them  a  two-vowel-string  word 
and  22  of  them  thrcc-vowel-string  words.  The  next  line  shows  that  there  were  three 
total  words  with  the  string  noun,  adjective,  and  verb,  one  of  them  a  two-vowel-string 
word  and  two  of  them  three-vowel -siring  words.  This  continues  until  the  tenth  line, 
which  indicates  that  more  than  nine  parl-of-speech  strings  had  been  encountered .  at 

*To  avoid  the  cor.. plication  of  considering  archaic  or  little-used  words,  only  words 
having  a  standard  meaning  in  both  dictionaries  were  used. 

**Thc  parts  of  speech  recorded  on  tape  are  as  follows:  noun  (N),  adjective  (AJ),  verb 
(V),  adverb  (AV),  preposition  (PR),  conjunction  (CJ),  pronoun  (PN),  interjection  (IJ), 
past  verb  (PV).  The  category  other  (OT)  was  used  whenever  the  dictionary  gave  some 
part  of  speech  other  than  the  nine  listed;  OT  comprises  mainly  participles  and 
collective  nouns. 
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xampic  of  Affix  Statistics  Output  by  the  Computer  Program 


4  which  point  the  program  terminated  the  examination  of  this  affix.  Note  that  the  column 

headed  TOT  shows  the  distribution  according  to  part-of-specch  of  all  words  prefixed 
with  inter  and  the  columns  headed  Nvs  show  the  distribution  according  to  part-of-specch 
of  words  with  N  vowel  strings.  The  distribution  according  to  vowel  strings  was 
obtained  because  it  had  been  noted  that  there  was  a  general  tendency  for  the  percentage 
of  noun-adjective  words  to  increase  with  the  number  of  syllables. 

Study  of  the  part-of-speech  distributions  of  the  words  with  affixes  in  Set  I  shows 
that  the  words  with  a  given  affix  have  an  average  of  eight  or  more  part-of-specch 
combinations  associated  with  them,  and,  in  general,  there  is  wide  distribution  of  the 
words  among  the  different  part-of-speech  strings.  In  fact,  the  results  indicate  that 
it  will  be  impossible  to  assign  a  ion  percent  unique  part-of-speech  string  to  a  word 
on  tile  basis  of  its  affixes.  What  should  be  possible  is  to  establish  an  algorithm  which 
will  be  95  percent  correct  in  assigning  an  "inclusive"  part-of-speech  string,  by  which 
we  mean  a  string  which  will  include  all  of  the  dictionary-assigned  parts-of-speeeh, 
but  which  may  include  some  extraneous  parts-of-speech. 

Since,  as  already  noted,  the  majority  of  multisyllabic  words  can  be  used  only  as 
nouns  or  adjectives,  this  will  be  the  point  of  departure  in  deriving  a  part-of-specch 
uigorithrn.  All  words  which  do  not  behave  as  nouns,  or  adjectives,  or  nouns  and 
adjectives  only  are  to  be  considered  exceptional,  to  be  listed  or  to  be  identified  as  ex¬ 
ceptional  by  examination  of  their  affixes.  The  algorithm  will  be  constructed  to  identify 
the  exceptions,  leaving  the  rest  to  be  given  the  basic  assignment  of  noun-adjective  for 
multisyllabic  words  or  noun-adjective-verb  for  one-syllable  words. 

Because  they  arc  manageably  few,  ull  adverbs  not  ending  in  l^;,  and  all  propositions, 
conjunctions,  interjections,  and  irregular  past  tense  verbs  ean  be  removed  and  put  in  a 
special  exception  list.  This  leaves  combinations  of  noun,  adjective,  verb,  and  "other" 

V. 
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to  deal  with,  where  "other"  comprises  participial  forms  and  collective  nouns.  Regular 
forms  of  participles  ^an  be  recognized  by  the  inflectional  endings  ing  or  ed  and  irregular 
forms  of  participles  and  collective  nouns  are  few  enougn  so  that  they  can  be  added  to  the 
exc  ion  list.  (So  also  can  all  words  which  end  in  ing  or  ed  but  arc  not  participial  forms.) 
Seven  possible  part-of- speech  combinations  remain: 


(1) 

noun 

N 

(2) 

adjective 

AJ 

(3) 

noun  and  adjective 

N-AJ 

(4) 

verb 

VB 

(5) 

noun  and  verb 

N-VB 

((X) 

adjective  and  verb 

AJ-VB 

(7) 

noun,  adjective,  and  verb 

N-AJ-VB 

Since  most  nouns  can  be  used  as  adjectives,  and  since  the  AJ-VB  combination  is  un¬ 
common  except  for  participles,  which  are  already  taken  care  of,  the  seven  combina¬ 
tions  can  be  reduced  to  four  by  merging  3  with  1,  and  5  and  6  and  7,  to  give: 


(1) 

noun  and  adjective 

NA 

(3) 

adjective 

AJ 

(3) 

verb 

VB 

(4) 

verb  and  (noun  and/or 

adjective) 

NAVB 

To  put  it  another  way,  there  are  two  large  classes  of  multisyllable  words,  NA  and  NAVIl, 
which  must  bo  distinguished,  in  addition,  the  class  AJ  must  l>e  distinguished  I'rom  the 
NA  and  the  class  VB  from  the  NAVB.  Whenever  those  distinctions  cannot  be  made  with 
95  percent  accuracy,  assignments  will  Ik*  made  to  the  inclusive  set. 

The  construction  of  the  algorithm  thus  becomes  quite  simple,  a  matter  ol  studying 
the  distribution  of  the  part-of-speech  strings  for  each  affix,  ignoring  any  part-oi-sjH’ee ii 
other  than  noun,  adjective,  or  verb.  In  accordance  with  the  95  jiercent  criteria,  an 
affix  lor  which  95  percent  of  the  words  with  that  affix  have  a  single  part  of  s|H*ech,  either 
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AJ  or  VD,  will  be  classified  as  '’adjectival"  or  "verbal,"  respectively,  and  the  algorithm 
will  simply  assign  words  containing  such  an  affix  to  the  AJ  or  the  VB  class  instead  of 
to  the  basic  NA  class.  Affixes  for  which  95  percent  of  the  words  are  nouns  and/or 
adjectives,  but  not  verbs,  may  be  considered  as  "neutral,"  since  words  containing  them 
behave  as  nouns  and/or  adjectives  in  accordance  with  the  general  rule.  An  affix,  how¬ 
ever,  for  which  5  percent  of  the  words  (and  more  than  5  words)  have  a  verb  usage  will 
be  classified  "noun-verbal,"  and  words  containing  such  an  affix  will  be  assigned  to  the 
NAVB  class.  As  already  indicated,  all  words  which  do  net  contain  in  affix  and  which 
are  not  in  an  exception  list  arc  classified  as  NA  if  multisyilabic  and  NAVB  if  one  syllable. 

It  must  be  realized  that  a  good  many  ambiguities  will  be  introduced  by  this  p'-gorilhm. 
For  example,  for  words  prefixed  with  inter,  71  of  the  211  words  in  our  data  set  have  a 
verbal  usage,  with  further  breakdown  as  follows: 

noun  and  verb  23 

noun  adjective  and  verb  3  NAVB  27 

or 

adjective  and  verb  1 

verb  44  VB  44 

Accordingly,  words  beginning  with  inter  will  be  assigned  to  the  NAVB  class,  obtaining 
the  correct  inclusive  part-of-specch  for  71  words  at  the  cost  of  intnxiucing  the  extraneous 
part-oi-speech  VB  to  the  140  well-behaved  NA  words.  The  situation  is  worse  in  the 
ambiguity  between  the  AJ  and  the  NA  classes.  For  example,  although  al>out  H  percent  of 
words  ending  in  the  suffix  ful  are  adjectives,  34  out  of  the  total  109  have  a  noun  usage, 
so  rather  than  take  a  20  percent  error  of  omission,  ful  is  regarded  as  a  neutral  suffix 
and  an  extra  part-of-speech  has  been  introduced  in  80  percent  of  the  words.  By  stretch¬ 
ing  a  point,  the  suffix  less  can  be  considered  adjectival,  since  it  is  94  percent  adjectival, 
but  many  other  adjective- tending  affixes  encountered  cannot  (ic,  54  percent;  able,  79  per¬ 
cent,  ish,  70  percent;  ial,  61  percent,  us,  s7  percent;  mis.  61  percent). 
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A  part-of- speech  implication  of  either  NAVB,  VB.  AJ.  or  neutral  (i.e. ,  NA)  has 
been  determined  for  all  of  the  affixes.  These  implications  are  listed  in  Table  2-.'}. 
When  there  were  fewer  than  five  words  with  a  given  affix,  no  assignment  was  made. 
The  implications  of  the  operational  affixes  and  of  the  Dictionary  of  Modern  Knglisli 


Usage  affixes  break  down  statistically  as  follows; 


Neutral 


Knglish  Usage 


NAVB 


In  Table  2-3,  some  of  the  affixes  have  star  superscripts.  These  are  affixes  with  a 
NAVB  implication  which  in  words  of  feu**  or  more  syllables  may  l)c  regarded  as  neutral, 
since  in  the  dictionary  there  were  fewer  than  three  4-  to  8-vowcl-string  words  with 
these  affixes  which  possessed  verbal  usages.  NAVB  affixes  which  are  neutral  for  5-  to 
8-vowel -string  words  were  not  considered  because  there  arc  only  about  1,250  of  these, 
while  there  are  about  11,250  4-  to  8-vowel-string  words. 

There  arc  some  words,  of  course,  which  have  both  prefix(es)  and  suffix(es).  As 
the  part-of-speech  tabulations  for  suffixes  were  independent  of  prefixes,  and  vice  versa, 
there  was  a  possibility  of  a  particularly  influential  and  common  affix  introducing  an 
extra  part-of-speech  into  the  part-of-speech  counts  of  other  affixes.  For  example, 
suppose  that  all  the  words  with  the  prefix  "tram- ”  su  re  always  nouns  except  those 
which  ended  in  verbal  suffixes  such  as  er  or  ate  as  "transfer"  and  "translate."  Then 


"trails"  would  be  assigned  the  implication  NAVB  when  U  should  have  been  neutral.  To 
lest  this  possibility,  the  Set  I  prefix  counts  were  repeated  with  all  words  having  nonneutral 
suffixes  omitted  from  the  data  set.  However,  the  part -of- speech  implication  o{  all  pre¬ 
fixes  remained  the  same.  Since  none  of  the  part-of-speech  implications  of  ;e  prefixes 
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Wfixee  Set  I!  Affixes  Set  IH 


changed,  it  was  decided  that  it  was  unnecessary  to  test  suffixes  on  a  set  from  which 
prefixed  words  had  been  removed. 


Prefixes  were  chosen  for  the  test  because  the  suffixes  seem  to  have  a  stronger 
influence  than  prefixes  m  multi-affixed  words,  as  for  example  the  neutral  ism  wins 
over  the  NAVB  ex  in  "exorcism, "  and  the  verbal  ize  wins  over  the  neutral  vul  in 
"vulcanize.  "  Suffices  would  thus  cause  much  more  of  a  pro  .1,  t  the  prefix  counts 
than  prefixes  in  the  suffix  counts.  The  one  easily  noted  exception  to  the  rule  of  suffix 
ascendancy  is  for  such  words  as  "automation"  and  "vulcanization. "  in  which  the 
neutral  auto  and  vul  seem  to  be  ascendent  over  the  NAVB  ion.  However,  a  considera¬ 
tion  of  other  words  in  which  both  prefix  and  suffix  are  NAVB,  as  in  "demolition," 
construction,  accession,  etc. ,  indicate  that  there  is  a  group  of  important  suffixes 
beginning  with  t  or  s  which  failed  to  show  up  in  the  operational  definition  of  affixes. 
To  test  this  hypothesis,  these  possible  suffixes  were  subjected  to  the  part-of-spcceh 
tests  for  affixes  with  the  following  results: 

Suffix  POS  Implication 


tion 

Neutral 

sion* 

NAVB 

tial 

Neutral 

sial 

AJ 

tive 

Neutral 

sive 

Neutral 

tious 

AJ 

Examination  of  the  suffix  tious  lent  to  examination  of  the  weak  suffix  fx>ssibil ily 
pus,  which,  like  tious.  turned  out  to  have  .strongly  adjectival  implications.  Undoubtedly, 
these  suffixes  do  exist  and  have  strong  part-of-speech  connotations.  For  the  sake  of 
completeness,  they  have  been  added  to  Tattle  -’-3  as  Set  HI 
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Whether  or  not  the  use  of  the  part-of-spccch  implications  reported  in  this  paper 
will  be  adequate  to  produce  95  percent  accurate  part-of-spccch  by  algorithmic  assign¬ 
ment  remains  to  be  seen.  They  arc,  of  course,  guaranteed  to  produce  95  |>erccnt 
inclusive  accuracy  on  words  with  listed  affixes.  It  is  not  yet  known  hew  many  non- 
affixed  words  there  are,  nor  how  well  they  fit  the  general  rules.  Before  com|irc- 
hensive  testing  can  take  place,  it  may  be  necessary  to  develop  more  definitive  rules 
for  determining  when  an  affix  is  acting  as  an  affix  in  a  g'vcn  word. 
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3,  ON  THE  INFLECTION  OF  WRITTEN  ENGLISH  ADJECTIVES 


II.  L.  Ilcsnikoff  and  L.  Dolby 

The  part -of -speech  algorithm  under  development  is  predicated  on  the  assumption 
that  it  is  possible  to  determine  the  |>arts  of  sqieech  of  English  words  without  the  use  of 
extensive  dictionaries.  But  it  is  by  no  means  evident  that  the  eight  traditional  parts - 
of-specch  classes*  are  meaningful  cflcclions  of  the  structural  properties  of  the 
English  language,  and  it  must  be  supitosed  that  they  l.av  ••olevance  to  English  only 
insofar  as  English  bears  a  genetic  relationship  to  Lat  However,  the  two  languages 
are  vastly  different  in  im)X>rtant  rcs|iccls,  and  there  is,  therefore,  no  real  reason  to 
believe  that  the  Latin  norms  are  meaningful  in  the  description  of  English. 

The  traditional  definitions  of  the  English  parts  of  s|iecch  do  not  help  to  allay  the 
suspicion  that  the  i*arts-ol-s|>eoch  classes  are  Lite  product  of  the  desire  of  the  early 

l  • « 

English  grammarians  to  fit  English  to  the  Latin  mold.  Gleason  has  written, 

English  grammar  is  traditionally  described  in  terms  of  eight 
parts  of  s|>ecch  ....  These  eight  classes  arc  of  quite  diverse 
character  ami  validity.  The  familiar  definitions  overlap  and 
conflict,  or  arc  so  vague  as  to  be  nearly  inapplicable.  Some 
parts  of  S|>ccch  gather  together  .\  number  of  not  verj  obviously 
related  types  of  words.  In  other  cases,  the  line  of  demarcation 
between  parts  of  s|>eech  is  rathe.*  arbitrary. 

These  views  contrast  sharply  with  the  basic  premise  of  the  Indexing  Project.  The 
Project  is  attempting  to  index  texts  by  using  a  sentence  dictionary,  that  is,  a  collection 
of  the  distinct  pans-of-speoch  sequences  occur r  ng  in  Engu»u  sentences,  liascd  on  the 
traditional  parts -of-spevch  classifications  with  only  minor  modification*.  h,  indeed, 
these  classes  are  meaningless,  or  if  the  assignment  of  English  worths  to  these  classes 

•Noun,  pronoun,  adjective,  verb,  adverb,  preposition,  conjunction,  ami  interjection. 
••Page  92. 
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is  capricious,  then  it  is  not  possible  for  the  sentence  dictionary  to  have  much  utility  in 
the  solution  of  the  indexing  prcblem. 

For  this  reason  it  is  important  to  show  that  the  traditional  parts -of-spcech  classes 
do  correspond  closely  to  structural  properties  of  English  words.  In  fact,  if  a  close 
correspondence  can  bo  discovered,  then  it  can  be  used  to  provide  a  structural  definition 
of  the  parts -of-speech  classes,  and  this  will  have  the  vii’tue  of  essential  agreement 
with  existing  sources  of  data,  e.  g. ,  dictionaries. 

There  are  several  distinct  ways  of  illustrating  structural  properties  of  parts -ol- 

speecb  classes.  One  way  is  to  construct  an  algorithm  that  will  generate  the  parts -of- 

spcech  class  of  a  given  word  from  the  graphernie  shape  of  the  word  (together  with 

certain  other  structural  information  which  is  independent  of  the  particular  word  under 

examination,  and  without  the  use  of  comprehensive  dictionaries).  It  is  not  yet  known 

to  what  extent  this  is  possible,  although  certain  progress  has  been  made.  For  instance, 

the  niultivowel -string  words  ending  with  a  are  very  uniformly  nouns.  The  authors" 

showed  that  the  sot  of  one-vowel-string  words  depleted  by  the  "structure  words"  and 

the  -le  suffixed  words  form  a  part -of -speech  category:  that  is,  almost  all  such  words 

belong  to  the  category  noun-adjectivc-vero.  Results  reported  in  the  first  annual 
3 

report  show  that  it  is  possible  to  construct  a  reasonably  straightforward  algorithm 
which  will  correctly  determine  the  parts -of-spcech  class  of  a  random  sample  drawn 
from  a  dictionary  with  an  accuracy  of  between  70  and  80  percent  on  the  standard  words. 
This  is  not  very  good  in  terms  of  an  algorithm  mat  can  be  used  reliably  as  a  component 
in  a  functioning,  utilitarian,  English  text  processing  system.  However,  it  is  strong 
evidence  that  the  traditional  parts -of-spcech  classifications  must  indeed  bear  a  close 
relationship  to  the  structural  properties  of  English. 

♦Section  7  and  footnote  22. 
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It  might  be  true  that  the  algorithm  reflects  the  structured  assignment  of  parts  of 
speech  but  that  parts  of  speech  have  nothing,  or  at  best  little,  to  do  with  the  structure 
of  English.  In  other  words,  it  might  happen  (although  the  authors  believe  this  to  be 
farfetched)  that  the  traditional  classification  is  orderly  but  that  the  order  is  one 
imposed  by  the  early  grammarians  in  some  complicated  way  not  really  related  to  the 
direct  properties  of  the  language.  If  this  possibility  is  admitted,  it  becomes  of  interest 
to  find  some  relationship  between  the  parts-of-speeeh  assignments  and  some  clearly 
significant  structural  property  of  English.  In  this  paper  we  will  describe  such  a 
relation. 

The  traditional  grammarian,  George  Curme,  distinguishes  two  types  of  comparison, 

4* 

i.c. ,  inflection,  of  adjectives  in  English. 

There  arc  two  quite  different  types  of  inflection  employed 
in  comparing  English  adjectives  -  the  terminational  and 
the  analytic 

1.  Termination  type  of  comparison.  In  this  type  we  add 
to  the  positive  -cr  to  form  the  comparative  and  -est  to 
form  the  superlative:  strong,  stronger,  strongest.  This 
way  of  comparing  adjectives  was  universal  in  Old  English, 
but  it  is  now'  confined  to  words  of  one  syllable  and  a  large 
number  of  words  of  two  syllables,  especially  these  in 
-er,  -le,  -y,  -ow,  -some.  .  . 

2.  Analytic  type  of  comparison.  Here  we  put  more 
before  the  comparative  and  most  before  the  superlative: 
beautiful,  more  beautiful,  most  beautiful.  Adjectives  and 
participles  with  more  than  two  syllables  regularly  follow 
this  type,  also  many  words  with  two  syllables.  ...  0 

Gleason  defines  adjectives  as  those  words  which  are  inflected  using  the  terminational 

type  of  comparison  described  by  Curme;  words  occurring  in  the  environments  in  which 

adjectives  are  found  (but  which  compare)  use  the  analytic  type  of  comparison  he  calls 

adjectives.  Both  types  will  be  referred  to  as  adjectives  in  this  paper. 


*Sei  Reference  4  for  an  extensive  discussion  of  English  verb  inflection. 

♦♦Page  220,  104.  B. 

***Pages  02-93.  There  are  also  a  small  number  of  irregularly  Inflected  adjectives. 
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In  Curnie's  description  quoted  above,  the  number  of  syllables  contained  in  the 
adjective  under  examination  is  important  in  determining  to  which  type  of  inflectional 
paradigm  it  belongs.  In  the  study  of  written  English  the  notion  of  syllable,  which  is 
phonological,  is  not  present.  It  must  be  replaced  by  the  number  of  admissible  vowel 
strings  contained  in  the  word,  according  to  the  method  developed  in  Reference  2.  For 
the  present  it  will  be  enough  if  we  approximate  to  that  definition  by  counting  the  final 
o  in  a  word  as  a  consonant,  and  then  counting  the  number  of  remaining  vowel  strings 
(i.e. ,  the  number  of  connected  sequences  of  vowels)  in  the  word.  Then  Curme's 
description  states  that  terminational  comparison  of  adjectives  is  reserved  primarily 
for  one -vowel -string  words  and  certain  two -vowel -string  words  containing  selected 
suffixes,  whereas  analytic  comparison  occurs  for  the  remaining  adjectives. 

Of  particular  interest  arc  the  one -vowel -string  adjectives.  Contrary  to  Curme's 
description,  there  are  large  numbers  of  one-vowel-string  adjectives  which  inflect 
analytically.  It  has  already  been  remarked  that  most  one -vowel -string  words  are 
noun -adjective -verbs,  and,  hence,  in  particular  they  are  adjectives.  Almost  any  one 
of  these  words  provides  an  illustration  of  analytic  comparison  for  one-vowel -string 
words.  Thus: 

charm  ,  bloat  .  squint  ,  ring  ,  bound  ,  flash  ,  etc. 

That  these  words  are  compared  analytically  is  not  due  to  any  hypothetical  inability 
to  carry  the  comparative  terminational  suffixes.  Each  of  the  words  given  in  illustra¬ 
tion  has  a  corresponding  noun  form  with  the  suffix  -or  appended,  but  in  no  case  is  this 
form  the  comparative  of  the  word.  Thus  it  would  appear  that  Curme's  description  does 
not  agree  with  the  facts  in  any  significant  way,  although  his  description  is  traditional. 

The  traditional  description  of  comparison  for  one-vowel -string  words  is  in  general 
disagreement  with  the  facts.  Nevertheless,  it  does  contain  a  hidden  kernel  of  truth 
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which  leads  to  a  rather  startling  structural  relationship  between  certain  classes  of 
words.  It  must  be  exactly  this  relationship  to  which  the  inadequately  phrased  tradi¬ 
tional  description  is  attempting  to  draw  attention. 

Suppose  that  the  set  of  one-vowel-string  words  which  do  not  end  with  the  sequence 
consonant  ^ie  is  denoted  by  W  .*  If  a  word  has  a  standard  usage  as  a  traditional  part 
of  speech,  in  Merriam -Webster's  New  International  Dictionary,  third  edition,  here¬ 
inafter  abbreviated  "MW3,"  then  it  will  ue  called  a  standard  noun,  standard  adjective, 
standard  verb,  etc. 

CLAIM:  (1)  The  standard  adjectives  in  W  which  are  not  standard  adverbs 
are  inflected  analytically,  i.  e. ,  by  using  more  and  most. 

(2)  The  standard  adjectives  in  W  which  are  also  standard  adverbs 
a»’c  inflected  terminationally,  i.  e. ,  by  using  the  suffixes  -or 
and  -est. 

In  the  following  paragraphs  we  will  substantiate  this  Claim.  First,  some  remarks 
are  in  order  as  to  the  meaning  of  the  Claim,  if  indeed  it  is  true.  In  view  of  the  discus¬ 
sion  of  the  relation  of  the  traditional  parts  -of-speech  classes  to  structural  properties 
of  English,  the  assertion  takes  on  a  special  importance.  The  assertion  is  that  the  set 
of  adjectives  of  a  certain  graphemically  defined  type  (namely,  those  that  belong  to  W  ) 
can  be  partitioned  into  two  classes  -  one  containing  the  analytically  inflected  adjectives 
and  the  other  containing  the  terminationally  inflected  adjectives  -  and  that  this  partition 
can  be  determined  solely  from  the  knowledge  of  the  pans -of-speech  classes  to  which 
the  adjectives  belong.  Thus,  a  direct  relationship  between  the  traditional  parts -of- 
specch  classes  and  an  easily  observed  structural  property  is  asserted.  This  lends 
weight  to  the  traditional  classification  in  a  very  impressive  way. 

*A  more  accurate  restriction  is  this:  W  denotes  the  set  of  elementary  words,  as 
defined  in  Reference  2.  In  particular,  almost  all  of  the  elementary  words  are  one- 
syllable  words  in  our  dialects,  and  conversely. 
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The  Claim  must  be  generously  interpreted.  It  would  be  false  to  assert  that  it  has 
no  exceptions;  what  is  really  meant  is  that  the  proportion  of  exceptions,  and  even  the 
particular  properties  of  the  exceptions,  show  them  to  constitute  a  maverick  and  rare 
sot  of  words,  which  either  belong  to  the  nucleus  of  words  with  so  many  meanings  or 
such  frequent  usage  that  it  is  almost  impossible  to  modify  or  destroy  them,  or  that 
they  belong  to  the  fringes  of  the  current  language  and  can  be  expected  to  fade  out  with 
time. 

Current  English  is  in  a  state  of  rapid  change.  Many  people  object  to  many  of  the 
changes  which,  they  contend,  debase  the  language.  In  particular,  there  has  been 
increasing  use  if  not  acceptance  of  such  phrases  as  drive  slow,  run  quick,  fresh  cut, 
etc.  The  words  slow,  quick,  fresh,  etc. ,  as  adjectives,  have  the  torminationa! 
comparison: 

slower  ,  slowest  ,  quicker  ,  quickest  ,  fresher  ,  freshest  ,  etc. 
According  to  the  Claim,  the  words  should  also  be  adverbs.  If  the  Claim  represents  a 
productive  property  of  English,  then  such  words  as  slow,  quick,  fresh,  etc. ,  must 
either  lose  the  terminational  inflection  as  adjectives,  or  take  on  the  part-of-spcech 
adverb  in  addition  to  their  other  parts  of  speech.  Evidently  the  latter  is  just  what 
occurs.  But,  in  reality,  these  words  are  not  assuming  adverbial  usage  as  a  eurrent 
novelty;  each  of  them  has  adverb  meanings  in  older  unabridged  dictionaries  such  as  the 
Merriam -Webster  2nd  edition. 
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We  will  now  turn  to  the  data.  To  ehoek  the  hypothesis  that  the  inflection  of  one- 
vowel -string  adjectives  is  a  function  of  the  adverb  part-of-speech  class,  a  random 
sample  was  drawn  from  English  Word  Speculum.  Voi.  I.* 

In  a  sample  of  11, 200  words,  randomly  distributed,  there  were  110  one-vowel¬ 
string  words  which  had  at  least  the  parts -of-speech  noun-adjective-verb  and  were 
standard  with  respect  to  each  of  these  classes.  Since  111  is  about  0. 98  percent  of 
11,200,  and  since  the  Speculum  I  contains  about  75,000  words,  one  can  expect  to  find 
about  750  words  with  these  properties  in  a  medium -size  dictionary. 

Of  the  111  words,  95  had  no  adverbial  usage,  13  were  standard  adverbs,  and  3 
were  nonstandard  adverbs.  Thus,  about  12  percent  of  the  111  words  were  standard 
noun-adjective-verb-adverbs,  and  one  would  expect  to  find  a  total  of  about  90  such  in 
a  medium -size  dictionary. 

Of  the  95  words  which  did  not  have  any  adverbial  usage,  only  2  inflected  the 
adjectival  form  using  terminational  inflection,  i.e. ,  about  2  percent.  This  supports 
the  first  part  of  the  Claim,  that  standard  adjectives  which  are  not  standard  adverbs 
are  inflected  analytically. 

Of  the  3  words  that  had  nonstandard  adverbial  usage,  2  had  obsolete  adverbial 
usage,  and  1  had  dialectical  adverbial  usage.  The  obsolete  word.;  follow  the  analytic 
inflection,  while  the  dialectical  word  follows  the  terminational  inflection.  This  is  not 
surprising,  first  because  the  obsolete  forms  may  be  already  discarded  from  the  current 
language,  and  second  because  dialectal  forms  may  be  quite  contemporary  and  popular, 


♦Reference  6  is  the  English  Word  Speculum,  whose  several  volumes  are  referred  to 
as  Speculum  I,  Speculum  II,  etc.  Speculum  I  contains  more  than  73,000  distinct 
words  [the  word  list  of  the  Shorter  Oxford  Dictionary  (SOX)]  together  with  part-of- 
speech  and  status  classes  from  both  the  SOX  and  the  MW3,  ordered  in  a  statistically 
random  fashion.  Speculum  II  contains  an  extracted  word  list  from  Speculum  I  together 
with  parts-of-speoch  and  status  information,  organized  so  that  all  words  with  a  fixed 
number  of  vowel  strings  are  brought  together,  and  within  each  of  these  classes,  the 
words  are  forward  alphabetized. 
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and  thus  reflect  the  productive  forms  of  the  language.  In  this  instance,  the  dialectal 
adverb  was  black,  with  inflection  blacker,  blackest. 

Of  the  13  standard  adverbs  in  the  collection,  3  inflect  analytically.  This  is  about 
38  percent  of  the  total,  and  does  not  verify  the  second  part  of  the  Claim  in  any  signifi¬ 
cant  way.  But  a  sample  of  13  words  is  too  small  to  have  any  statistical  significance. 
Furthermore,  in  attempting  to  analyze  the  adverb-adjectives  that  inflect  analytically 
we  encounter  a  lexicographical  problem  which  may  prove  to  be  decisive  for  the  limited 
collection  of  words  which  must  be  examined.  Dictionaries  typically  indicate  the  tormi- 
national  inflections  of  adjectives  explicitly;  when  a  terminational  inflection  is  indicated 
for  an  adjective,  wc  may  be  quite  certain  that  it  does  in  fact  exist  in  text  samples. 
However,  if  a  terminational  inflection  is  not  explicitly  indicated,  this  may  be  due  to 
one  of  several  causes:  the  adjective  is  inflected  analytically;  the  lexicographer  did  not 
work  enough  on  the  particular  word;  or  there  were  a  number  of  terminational  inflections 
for  the  adjective  that  appeared  in  the  corpus,  but  this  number  was  small  and  therefore 
discounted.  In  the  last  case,  one  must  worry  about  the  smallness  relative  to  the  usage 
of  the  word,  whieh  presents  further  complications.  Therefore,  in  general,  one  can  be 
confident  of  the  information  explicitly  given  in  dictionaries,  but  must  be  wary  of  infor¬ 
mation  which  can  only  be  inferred  from  the  absence  of  explicit  statements.  For 
example,  we  cannot  be  certain  that  the  standard  adjective-adverb  dang  does  not  have 
the  inflection  danger,  dangest,  although  these  forms  arc  not  attested  in  MW3.  But  the 
comparative  form  is  quite  unlikely,  both  because  it  coincides  with  a  more  common 
word  with  very  different  meaning,  and  aiso  I  ‘cause  it  is  difficult  to  assign  a  compara¬ 
tive  to  a  word  such  as  dang  for  semantic  reasons,  although  the  superlative  presents 
neither  of  these  problems. 

This  lust  example  illustrates  yet  another  difficulty  associated  with  the  determina¬ 
tion  of  the  analytically  inflecting  adjectives.  There  are  certain  adjectives  which  do 
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not  occur  in  the  comparative  or  superlative.  For  these  adjectives,  the  absence  of 
explicit  information  about  terminational  inflections  docs  not  necessarily  imply  the 
existence  of  analytic  inflections;  it  may  well  be  that  these  adjectives  cannot  support 
inflected  forms  for  semantic  or  other  reasons,  or  it  may  simply  be  that  their  fre¬ 
quency  of  usage  is  so  low  that  the  inflected  forms  have  not  yet  been  observed.  The 
latter  is  probably  true  of  dang,  while  the  former  seems  to  be  a  reasonable  explanation 
for  the  lack  of  terminational  inflection  for  the  adjective-adverb  last;  for  the  analytic 
forms  more  last  and  most  last  do  not  appear  likely. 

The  13  standard  adverbs  are  listed  in  two  columns.  The  left-hand  column  con¬ 
tains  those  words  with  terminational  inflection  of  the  adjective;  the  right-hand  column 
contains  those  for  which  the  inflections  are  not  terminational.* 


stiff 

pat 

near 

dang 

keen 

south 

light 

last 

dear 

snap 

cool 

fine 

dry 

In  the  right-hand  column,  the  words  dang  and  last  have  already  been  discussed; 
the  geographical  directions  north,  south,  east,  and  west,  all  are  exceptions  to  the 
Claim  (as  are  the  ordinal  numbers).  It  may  be  that  terminational  forms  of  pat  remain 
to  be  uncovered.  If  all  these  factors  are  taken  into  account,  the  second  part  of  the 
Claim  may  not  be  in  great  difficulty  after  all.  But  the  sample  is  much  too  small  to  be 
of  guidance. 

To  study  the  second  part  of  the  Claim,  we  must  have  a  larger  collection  of 
adjective -adverbs  belonging  to  the  sot  W  .  To  this  end  the  standard  one-vowel -string 

•Once  again  we  warn  the  reader  that  this  does  not  imply  that  there  are  analytic 
inflections  for  these  words;  there  may  be  no  observed  inflections  whatsoever. 
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noun-adjcctive-vorb-adverb  words  not  ending  with  consonant  have  been  collected 
from  Speculum  II.  These  words,  97  in  number,  are  listed  in  Table  3-1.  Also  listed 
in  Table  3-1  arc  the  5  words  of  this  category  that  did  end  in  consonant  ;dc  to  give  some 
indication  of  just  what  we  are  omitting  from  our  collection. 

Of  the  97  words  in  the  collection,  only  60  use  the  tcrminational  inflection  of 
adjectives;  37  have  no  such  indication  in  the  Merrian-Webster  2nd  edition  (these  two 
classes  arc  given  explicitly  in  Table  3-1).  This  represents  only  about  60  percent 
agreement  with  the  second  part  of  the  Claim,  reflecting  almost  exactly  the  proportion 
indicated  by  the  small  listed  sample  of  13  standard  adverbs.  But  now  that  this  sub¬ 
stantially  larger  and  complete  collection  is  available,  it  will  be  possible  to  analyze  it 
in  a  more  detailed  fashion. 

We  have  partitioned  the  set  of  37  nonterm  (national  words  into  two  parts:  the  set 
of  words  which  are  standard  adjective-adverbs  in  both  the  SOX  and  the  MW3,  and  the 
set  of  words  that  one  of  these  sources  indicates  a  nonstandard  adjective  or  adverb 
usage.  Table  3-2  shows  this  classification.  The  notation  following  the  words  in  the 
second  column  indicates  the  nonstandard  usage  according  to  the  following  conventions, 
the  letters  s  ,  c  ,  r  ,  d  ,  and  o  occurring  inside  of  parentheses  refer  to  standard, 
colloquial,  rax*e,  dialectical,  and  obsolete  usage,  respectively.  The  four  iwsitions 
within  the  parentheses  refer,  reading  from  left  to  right,  to  noun,  adjective,  verb,  and 
adverb  usage,  respectively.  A  period  (.)  in  one  of  the  positions  indicates  that  the 
corresponding  usage  is  not  given  in  li  e  source  under  consideration.  Each  parentheses 
is  followed  by  either  the  letter  x  ,  <'  noting  SOX,  or  the  letter  w  ,  denoting  MW3. 

Of  the  13  words  in  the  top  part  of  Table  3-2  the  word  bias  contains  an  ii  admissible 
vowel  string,  and  really  should  not  nppenr  in  the  corpus;  however,  it  is  the  only  such 
word,  and  it  may  be  simpler  for  tho  reader  if  it  is  included  along  with  the  other  words 
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Table  3-1 


STANDARD  ONE -VOWEL-STRING  NOUN-ADJECTIVE-VERB-ADVERBS 

FROM  SPECULUM  ll(a) 

Consonant  ^lo  Words 

double  single  tickle  treble  triple 


Nonconsonant  ^le  Words  With 
Terminations!  Adjectival  Inflection 


blind 

flat 

mean 

shrill 

spruce 

chance 

flush 

t>at 

slack 

square 

clean 

foul 

prime 

sleet; 

steep 

clear 

fresh 

prompt 

slick 

stern 

close 

full 

pure 

slight 

stiff 

coid 

glib 

queer 

slow 

straight 

cool 

grave 

quiet 

small 

sweet 

dry 

just 

right 

smart 

thin 

faint 

keen 

rough 

smooth 

tough 

fair 

lax 

sharp 

snug 

trim 

fine 

loose 

sheer 

sour 

tn.c 

firm 

low 

short 

spare 

warm 

Nonconsonant  -lc  Words  Without 
Tcrminational  Adjectival  Inflection!**) 


back 

dutch 

north 

side 

squab 

bias 

cast 

part 

slant 

stick 

bone 

flounce 

P‘ 

smash 

stump 

chock 

front 

plumb 

snap 

third 

dab 

home 

rear 

sncil 

west 

damn 

jam 

rush 

sole 

darn 

last 

scale 

splash 

dog 

mock 

shoal 

splay 

(a)  The  various  parts  of  speech  ot  these  words  aro 
standard  in  at  least  the  SOX  or  MW3  and  have  no 
other  pails  of  speech. 

(b)  Note  that  this  does  not  imply  the  existence  of 
analytic  adjectival  inflection. 
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Tabic  3-2 


NONTERMINATIONALLY  INFLECTED  WORDS  FROM  TABLE  2-1 

Standard  Adjective-Adverbs 
in  Both  SOX  and  MW3 


back 

north 

squab 

bias 

part 

thirl 

cast 

plumb 

west 

home 

shoal 

last 

slant 

Nonstandard  Adjective- Adverb  Usage 
in  Either  SOX  or  MW3 


bone 

(888.  )X 

rear 

(ssso)x 

chock 

(S.BS)W 

rush 

(8.8.  )X 

dab 

(B.P  JW 

scale 

(s.s.  )x 

damn 

(S.8.  )X 

side 

(sdso)x 

darn 

(S.  S. )X 

smash 

(s.s. 

dog 

(SSS.  )X 

snap 

(s.s.  )x 

dutch 

(sss.  )x  ami  (ccss)w 

sneli 

(sssd)w 

flounce 

(s.s.  )x 

sole 

(sss.  )x 

front 

(sss. )x 

splash 

(s.  ss)x  and  (sss.  )w 

jam 

(sss.)x  and  (s.ss)w 

splay 

(sss. )w 

mock 

(rss.  )x 

stick 

(s.  s. )x 

l»i 

<sd. . }x 

stu  up 

(sss. >w 

at  this  stage  of  the  argument.  It  may  lx*  worth  remarking  that  it  is  the  only  two- 
syllable  word  in  the  collection  (in  our  dialects). 

The  bottom  part  of  Tabic  3-2  shows  that  there  is  considerable  disagreement 
between  the  SOX  and  the  MW3  with  respect  to  the  classification  of  adverbs,  and  to  a 
lesser  extent,  of  adjectives.  It  is  evident  that  the  SOX  is  much  more  conservative, 
i.e. ,  has  a  higher  frequency  threshold  for  the  admission  of  adverbial  usage  than  does 
MW3.  But  i;  is  also  evident  that  the  SOX  principles  are  in  close  accord  with  the 
second  part  of  our  Claim. 

11  we  agree  that  dictionaries  are  most  reliable  when  several  of  them  agree,  then 
we  will  be  urged  to  discard  the  words  in  the  right -hand  column  of  Table  3-2  when 
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4.J  examining  the  agreement  with  the  Claim.  If  this  is  done,*  then  there  remain  73  words 

in  the  collection  (including  bias),  of  which  only  13  do  not  have  tcrminalional  inflection. 
That  is,  the  second  part  of  the  Claim  is  true  for  82  percent  of  the  words. 

An  examination  of  these  13  words  which  do  not  agree  with  the  second  part  of  the 
Claim  is  fruitful.  These  arc  the  words  in  the  top  part  of  Table  3-2.  The  three  goo- 
graphic  directions  east,  north,  and  west  appear,**  and  the  ordinal  third  also  appears. 

It  is  evidently  impossible  for  ordinals  to  have  comparative  or  superlative  inflections 
for  semantic  reasons:  the  most  third  is  no  better  than  the  thirdest.  Similar  remarks 
apply,  but  based  on  more  personal  evaluations,  for  the  words  home  and  last,  and 
perhaps  for  some  of  the  others  as  well. 

Thus  it  may  be  that,  after  semantic  considerations  have  been  accounted  for,  the 
agreement  with  the  second  part  of  the  Claim  will  In?  in  the  90  |>erccnt  range.  Due  to 
the  difficulties  inherent  in  obtaining  adequate  and  complete  information  to  test  the 
Claim,  such  a  level  of  agreement  would  be  impressive.  For  the  present,  however, 
we  will  have  to  satisfy  ourselves  with  the  weaker  82  percent  agreement. 

The  relationship  between  tcrminutionally  inflected  adjectives  and  adverbs  can  be 
used  in  the  determination  of  the  parts  of  speech  of  testa  in  two-vowel -string  words. 

Doth  the  comparative  and  the  superlative  of  such  adjectives  are  two -vowel -string  words 
(because  the  words  discus^*  above  were  ail,  with  the  exception  of  the  excludable 
word  bias,  one-vowel -string  words).  But  the  comparative  suffix,  -er.  coincides  with 
a  suffix  with  a  quite  different  structural  role,  and  therefore  can  be  confused  with 

*W a  really  should  eliminate  those  words  in  Table  3-1  which  do  have  tcrminalional 
adjectival  inflection  but  arc  not  standard  adjectives  am)  adverbs  in  both  SOX  and  &W3; 
but  we  have  not  actually  done  this  It  seems  that  the  results  would  not  bo  much  dif¬ 
ferent.  although  the  expenditure  of  effort  would  be  considerable. 

**The  reader  will  recall  that  Speculum  It  contains  only  those  weeds  whose  parts  of 
speech  are  included  among  noun,  adjective,  verb,  and  adv  erb.  The  word  south  has 
ether  parts  of  speech,  and  therefore  does  not  appear  in  Table  3-1  or  Table  3-3;  the 
r  same  is  true  of  dang,  which  appeared  in  the  random  sample  discussed. 
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the  latter.  It  is  clear  that  the  comparatives  of  adjectives  are  themselves  adjectives, 
and  dictionaries  often  take  this  fact  for  granted  and  do  not  explicitly  indicate  that  a 
given  word  is  the  comparative  form  of  an  adjective.  For  example,  the  words  cooler 
and  fuller  are  both  listed  as  nouns  but  not  as  adjectives  in  both  SOX  and  MW3.  Clearly, 
the  dictionary  user  is  supposed  to  recognize  that  these  are,  in  addition  to  noun  (and 
perhaps  still  other)  usage  words,  comparatives  of  adjectives.  This  being  the  case,  it 
is  necessary  for  a  parts -of-speech  predicting  algorithm  to  distinguish  those  -er  forms 
which  are  not  comparatives  from  those  that  are. 

Inis  can  be  achieved  in  the  following  way.  Only  those  one-vowel -string  words 
that  are  adverbs  as  well  as  adjectives  compare  using  terminational  inflection;  we  will 
assume,  in  agreement  with  the  second  part  of  the  Claim,  that  all  such  words  do  com¬ 
pare  in  this  way.  Then  a  two -vowel -string  word  ending  with  -cr  can  be  expected  to  be 
the  comparative  of  an  adjective,  say  A  ,  if  the  v/ord  is  of  the  form  Acr  and  if  A  is 
both  an  adjective  and  an  adverb.  As  we  have  seen,  the  collection  of  all  one-vowei- 
striiig  adjective -adverbs  is  not  large;*  hence,  these  can  be  stored  in  a  dictionary  in  a 
parts -of -speech -predicting  algorithm. 

In  illustration,  consider  the  forms  cooler  and  fuller  discussed  above.  They  are 
of  the  form  Aer  with  A  standing  for  cool  and  full  respectively,  both  of  which  arc 
adjective-adverbs  in  the  one-vowel -string  word  class.  Hence,  both  cooler  and  fuller 
are  comparatives  of  adjectives  (hence  are  adjectives)  in  addition  to  any  other  parts  of 
speech  properties  they  may  have. 

*Note  that  Table  3-1  does  not  contain  all  one-vowel -string  adjective-adverbs,  but 
rather  only  those  that  are  also  noun-vorbs,  and  such  that  all  four  of  these  parts -of- 
spoech  categories  are  standard  for  the  words  involved,  and  such  that  no  other  parts  - 
of-speech  classes  occur.  But  these  restrictions  do  not  diminish  the  size  of  tho  class 
by  a  largo  factor  due  to  the  fact  that  the  one-vowel -st  ring  v.urds  essentially  form  a 
parts  -of-speech  catogory,  namely,  noun-adjective  -verb-adverb. 
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In  the  general  application  of  the  procedure  just  outlined,  it  will  sometimes  be 
necessary  to  take  into  account  algorithmic  spelling  changes.  For  example,  the 
adjective-adverb  dry  compares  as  drier  and  driest,  the  y  changing  to  i_.  Similar 
consistent  changes  are  described  in  Reference  4,  and  will  not  be  further  discussed 
here. 
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4.  AUTOMATIC  DETERMINATION  OF  PARTS  OF  SPEECH  OF  ENGLISH  WORDS 

L.  L.  Earl 

INTRODUCTION 

This  paper  describes  the  development  and  details  of  a  procedure  for  automatically 
assigning  part-of-speech  characteristics  to  English  words,  largely  from  graphemic 
considerations.  The  development  of  the  algorithm  began  with  the  observation  of  Dolby 
and  Resnikoff1  that  the  parts -of-speech  associated  with  one-syllable  words  are  frequently 
noun  (or  noun  and  adjective)  and  verb,  while  the  parts  of  speech  associated  with  multi¬ 
syllable  words  are  usually  noun  and  adjective  only.  Development  of  a  working  part- 
of-speech  algorithm  required  the  study  of  exceptions  to  this  general  rule  so  that 
analytical  subrules  and  exception  lists  sufficient  to  automatically  identify  all  such 
exceptions  could  be  derived.  Two  avenues  for  the  isolation  and  study  of  exceptions 
were  utilized: 

(1)  Exhaustive  sorts  of  a  73, 582  word  dictionary  on  magnetic  tape  were  used  to 
separate  and  classify  words  consistent  with  the  general  ride  from  those  that 
were  not. 

(2)  Analysis  of  possible  part-of-speech  implications  of  affixes  was  carried  out, 
by  computer,  on  the  same  dictionary. 

The  resulting  algorithm  developed  utilizes  a  prepared  dictionary  of  loss  than  800  words 
and  an  affix  list  of  loss  than  200  entries. 

TARTS  OF  SPEECH  USED,  AND  THEIR  ABBREVIATIONS 

The  tape  dictionary  used  for  both  analyses  contained  73, 582  words,  with  part -of- 

2 

speech  ami  word  status  information  from  Tho  Shorter  Oxford  Dictionary  and  the 
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Merrlam  Webster  New  International  Dictionary.  The  tape  dictionary  is  reliable  in 
most  respects,  since  it  was  made  from  punched  cards  transcribed  directly  from  the 
dictionaries,  verified  by  different  personnel,  rrid  spot  checked  periodically  during  the 
process.  Nevertheless,  errors  did  occur,  particularly  in  the  recording  of  part -of- 
speech  information  which  was  not  always  understood  by  the  keypunchers .  The  parts 
of  speech  recorded  are  as  follows: 


noun 

N 

adverb 

AV 

pronour 

PN 

adjective 

Aj 

preposition 

PR 

interjection 

IJ 

verb 

VB 

conjunction 

CJ 

past  verb 

PV 

In  addition,  the  category  "other"  (OT)  was  used  whenever  the  dictionary  gave  some  part 

of  speech  other  than  the  nine  listed  above.  OT  comprises  mainly  participles,  numerals, 

articles,  and  collective  nouns.  The  algorithm  was  designed  to  assign  these  same  nine 

parts  of  speech  (excluding  OT)  with  the  addition  of  four  more  which  were  unfortunately 

subsumed  under  OT:  present  participle  (PA),  past  participle  (PP),  auxiliary  verb  (AX), 

and  plural  or  collective  noun  (NP).  The  category  noun  was  changed  to  the  category 

noun-adjective  (NA)  on  the  grounds  that  nearly  ail  nouns  can  act  as  adjectives  under 

some  circumstances;  therefore,  although  we  will  try  to  distinguish  AJ  from  NA,  we 

will  not  try  to  distinguish  N  from  NA.  Collective  nouns  will  be  assigned  the  string  NA 

4 

and  NP  to  show  possible  use  with  either  singular  or  plural  verbs.  Although  a  dictionary 
may  show  additional  or  fewer  parts  oi  speech  for  participial  forms,  their  use  (or  lack 
of  use)  as  nouns,  adjectives,  or  verbs  will  be  considered  here  as  implicit  in  the  parti- 
eiplo  assignment,  and  no  attempt  will  be  made  to  distinguish  them.  Thus,  present 
participles  will  implicitly  be  possible  nouns,  adjeclivos,  or  in  a  verb  phrase,  and  past 
participle*  will  implicitly  be  adjectives,  past  verbs,  or  in  a  verb  phrase.  An  attempt 
wiii  be  made  to  identify  participles  which  have  any  other  special  usages,  and  to  identify 
irregular  past  tense  and  past  participial  forms. 
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DESIGN  PLAN 


In  the  design  of  a  part-of-speech  algorithm,  a  goal  of  95  percent  accuracy  was  set. 

To  begin  with,  three  basics  rules  were  postulated: 

Rule  A:  The  part-of-speech  string  associated  with  a  word  containing  only  one 

vowel  string  in  its  kernel  will  be  NA  -  VB,  where  a  kernel  will  be  defined 
as  a  word  stripped  of  its  affixes.  Similarly,  the  part-of-speech  string 
associated  with  words  with  multivovvel  string  kernels  will  be  NA. 

Rule  B:  The  part-of-speech  string  associated  with  a  wc  rd  ending  in  ed  will  be  PP, 
and  with  a  word  ending  in  ing  will  be  PA.  All  PP  will  also  be  considered 
PV.  A  NA  classification  will  be  changed  to  NP  for  all  words  ending  in 
single  s. 

Rule  C:  The  part-of-speech  string  associated  with  a  word  ending  in  iy;  will  be 
AJ  -  AV. 

Rule  A  is  basically  a  refinement  of  the  original  Dolby-Rcsnikoff  hypothesis  and  depends 

on  the  Dolby-Resnikoff  definition  of  a  legal  vowel  string.  It  also  depends  on  the  exist- 

o  6  - 

ence  of  an  operational  definition  of  affixes.  ’  Rules  B  and  C  are  a  recognition  of  the 

most  consistently  used  and  meaningful  suffixes  of  English. 

Design  of  the  algorithm  was  conceived  of  as  requiring  throe  steps: 

Task  1:  Tabulation  of  the  exceptions  to  Rules  B  and  C 

Task  2:  Tabulation  of  special-purpose  words,  with  part-of-speech  PR,  CJ,  PN, 
or  U,  which  are  not  covered  by  Rules  A,  B,  or  C, 

Task  3:  Modification  of  Rule  A  as  much  as  necessary  to  achieve  93  percent 

accuracy,  using  a  study  of  affixes,  or  lobulation  of  exceptions,  or  both, 
as  a  means  to  this  end. 
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The  firot  two  tasks  will  be  discussed  first,  and  then  the  considerably  more  involved  I  ) 

Task  3  will  be  summarized.  The  first  two  tasks  could  be  accomplished  by  sorting  the 

i.  .. 

j  dictionary  on  magnetic  tape,  as  mentioned  in  the  introduction,  although  it  may  be  of 

|  interest  that  not  all  of  the  data  handling  necessary  could  be  accomplished  with  a 

generalized  sort  routine.  7094  SORT  was  used,  but  special-purpose  routines  were 
also  developed. 

DICTIONARY  STUDIES 

Task  1:  Exceptions  to  Rules  B  and  C 

For  Tasks  1  and  2  the  tape  dictionary  entries  were  divided  into  2  categories, 
those  with  parts  of  speech  (POS)  limited  to  NA,  AJ,  VB  or  AV  and  those  having  at 

least  one  part  of  speech  other  than  NA,  AJ,  VB  or  AV.  To  find  the  exceptions  to 

i 

Rule  B,  the  entries  in  the  second  category  were  separated  into  two  lists. 

List  1:  Words  ending  in  ed,  tag,  or  single  s. 

List  2:  Words  not  ending  in  ed,  tag,  or  single  s. 

According  to  Rule  B,  all  words  in  List  1  should  be  categorized  as  OT  and  all  those  in 
List  2  should  not  be.  Exceptions  to  Rule  B  arising  from  List  1  are  in  Tabic  4-1  and 
those  arising  from  List  2  are  in  Table  4-2.  Only  words  ir.  standard  usage  arc  shown 
in  any  of  the  tables.  There  wore  only  18  words  in  the  exceptions  arising  from  List  1, 
and  these  are  all  shown  in  Table  4-1.  This  list  of  18  words  does  not  comprise  all  the 
words  ending  in  ed,  tag,  or  t  which  are  not  categorised  as  OT,  as  there  are  many 
more  of  these  in  the  NA,  AJ,  VB  category,  alto.  Fortunately  moat  such  category  1 
words  need  not  be  considered.  Words  ending  in  tag  need  not  be  considered  because 
their  actual  parts  of  speech  (usually  NA.  as  for  pudding)  are  subsumed  under  the 
participle  heading;  classifying  them  as  present  partiepies  will  be  correct  from  the 

I 
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Table  4-1 


EXCEPTION  WORDS  ENDING  IN  b,  ed,  OR  injr 


From  Category  I 


aliped 

NA 

atlas 

NA 

VB 

biped 

NA 

bonus 

NA 

VB 

callus 

NA 

VB 

canvas 

NA 

VB 

caucus 

NA 

VB 

census 

NA 

VB 

childbed 

NA 

VB 

chorus 

NA 

VB 

circus 

NA 

VB 

debarras 

VB 

debus 

VB 

disfoliaged 

NA 

embarras 

VB 

embed 

NA 

VB 

embus 

VB 

exceed 

VB 

fissiped 

NA 

focus 

NA 

VB 

gossipied 

NA 

hocus 

NA 

VB 

hotbed 

NA 

hundred 

NA 

interbed 

NA 

VB 

lobsided 

NA 

millipod 

NA 

misdeed 

NA 

mohammed 

NA 

monied 

NA 

NA 

(tat  mi|ied 

NA 

pinniped 

NA 

quadruped 

NA 

rebut 

NA 

VB 

sacred 

AJ 

*oliped 

NA 

thoroughbred 

NA 

vartabed 

NA 

watershed 

NA 

worsted 

NA 

From  Category  II 


across 

AJ  AV  PR 

alas 

IJ 

anything 

N(A)  AV  PN 

besides 

CJ  PN 

bring 

N(A)  VB  U 

cross 

NA  VB  AV  PR 

during 

PR 

hoicks 

VB  IJ 

minus 

NA  AV  PR 

nothing 

N(A)  AV  PN 

plus 

NA  VB  AV  PR 

something 

NA  VB  PN  AV 

theirs 

PN 

this 

N  VB  AJ  PN 

unless 

N|A)  PR  CJ 

various 

AJ  PN 

whereat 

N(A)  CJ 

whing 

N(A)  VB  IJ 

Table  4-2 

EXCEPTION  WORDS  DERIVED  FROM  LIST  II 


Irregular  Participle  and  Part  Tense  Verbs 


bet 

NA 

VD 

pp 

drew 

PV 

PP 

beaten 

PP 

drunken 

PP 

begotten 

PV 

PP 

driven 

PP 

bidden 

PV 

PP 

felt 

NA 

VB 

PV 

PP 

bitten 

PP 

flown 

ABSENT 

blown 

NA 

PP 

flew 

NA 

PP 

blent 

PP 

fought 

PP 

blest 

PP 

fraught 

AJ 

PP 

bound 

NA 

VB 

pp 

frozen 

PP 

Irate 

NA 

VB 

pp 

gilt 

NA 

PP 

b.trne 

AJ 

PV 

pp 

given 

PP 

born 

AJ 

PV 

pp 

gone 

NA 

PP 

bought 

AJ 

VB 

pp 

got 

PP 

bounck  2 

PP 

ground 

NA 

VB 

PV 

PP 

broke 

NA 

VB 

pp 

grit 

NA 

VP 

PP 

brought 

PV 

PP 

grew 

PV 

PP 

brant 

NA 

PP 

graven 

PV 

PP 

brafe  ten 

PP 

had 

PV 

PP 

bracken 

NA 

PP 

held 

PP 

broJien 

NA 

PP 

hewn 

PP 

buih 

NA 

VB 

pp 

hidden 

PP 

burst 

NA 

VB 

PV  PP 

hung 

VB 

l»V 

PP 

come 

NA 

VB 

pp 

knit 

NA 

Vti 

PP 

caught 

AJ 

VB 

PV  PP 

known 

NA 

PP 

chosen 

NA 

PP 

lay 

NA 

VB 

PP 

clad 

AJ 

VB 

pp 

bt 

NA 

Vll 

PP 

clove 

PV 

PP 

left 

NA 

VB 

PP 

clung 

PP 

lent 

PP 

cleft 

NA  PV  PP 

mode 

AJ 

PP 

cloven 

PV 

PP 

met 

NA 

PP 

could 

AX 

meant 

PP 

crept 

PP 

might 

NA 

PV 

PP 

cut 

NA 

VB 

pp 

mtsgotten 

i*v 

PP 

did 

PV 

PP 

mown 

PV 

PP 

done 

NA 

PV 

pp 

molten 

pp 

drove 

NA 

py 

pp 

ought 

NA 

VB 

AV 

PN  PP 

drunk 

NA 

pp 

paid 

AJ 

PP 

drawn 

PP 
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Irregular  Participle  and  Past  Tense  Verbs 


pent 

NA 

PP 

spiit 

Ni 

VB 

PP 

put 

NA 

VB 

PV 

PP 

spent 

VB 

PP 

q’»U 

NA 

VB 

PP 

spoken 

PP 

rang 

VP 

stole 

NA 

VB 

PV 

PP 

read 

NA 

VB 

VP 

PP 

strung 

PP 

reft 

NA 

PP 

stung 

PP 

rent 

NA 

VB 

PP 

stricken 

PV 

PP 

rung 

NA 

PP 

stolen 

NA 

PP 

run 

NA 

VB 

PP 

sung 

PP 

said 

A.J 

PP 

sunk 

A.J 

PP 

saw 

NA 

VB 

PP 

sunken 

PP 

sewn 

PV 

PP 

swam 

PP 

sent 

NA 

PP 

sworn 

PP 

should 

AJ 

AX 

swollen 

PP 

shod 

AV 

PP 

taught 

AJ 

PV 

PP 

shone 

1»V 

PP 

thrown 

AJ 

PP 

shrunk 

PP 

thought 

PV 

PI» 

shook 

NA 

VB 

PP 

threw 

PP 

shorn 

NA 

PV 

PP 

thrust 

NA 

VB 

PV 

PP 

shot 

NA 

VB 

PP 

told 

PP 

shaken 

PP 

torn 

PV 

PP 

sha|tcn 

PP 

trodden 

PP 

shotten 

PP 

went 

PP 

£*ha.  on 

PP 

were 

PP 

riven 

PP 

wet 

NA 

VB 

AV 

PP 

slunk 

NA 

PP 

widen 

PV 

PP 

siit 

NA 

VB 

PP 

svoke 

PV 

PP 

slew 

NA 

VB 

PP 

worn 

PP 

smelt 

NA 

VB 

PP 

would 

NA 

AX 

sought 

PP 

wound 

NA 

VB 

PV 

PP 

sottuen 

AJ 

VB 

PV 

PP 

wove 

NA 

PP 

wuoke 

NA 

VB 

PP 

woven 

PV 

PP 

spread 

NA 

VB 

PV 

PP 

written 

PV 

PP 

sprung 

PP 

wrought 

AJ 

PP 

spun 

PP 

wrung 

PP 
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0 


Irregular  Plural  or  Collective  Nowns 


apache 

cattle 

carp 

caribou 

Chinook 

cherubim 

dice 

couple 

crane 

Crustacea 

cutlery 

data 

dicta 

fish 

foe 

fulcra 

game 

geese 

genera 

grouse 

help 

hosiery 

ice 

ingesta 

Irish 

Japanese 

lice 

like 

lynx 


NA  NP 
NA  VB  NP 
NA  VB  NP 
NA  NP  • 

NP 

NA  NP 
NA  VB  NP 
NA  VB  NP 
NA  VB  NP 
NP 

NA  NP 

NP 

NP 

NA  VB  NP 
NA  NP 
NA  NP 
NA  VB  NP 
NP 
NP 

NA  VB  NP 
NA  NP  1J 
NA  NP 
NA  VB  NP 
NP 

NA  AJ  NP 
NA  NP 
NP 

NA  VB  AV 
NA  NP 


PR  CJ  NP 


marabou 

maxima 

mice 

milanese 

men 

pence 

jK'Ojile 

perch 

pike 

poultry 

regalia 

rice 

roc 

see  r  eta 

seraphim 

sheep 

snipe 

sjierm 

spawn 

spoor 

squid 

steer 

strata 

starfish 

swine 

tripe 

tuna 

viscua 

young 


NA  NP 

NP 

NP 

NA  NP 

NP 

NP 

NA  VB  NP 
NA  VB  NP 
NA  VB  NP 
NA  NP 
NA  NP 
NA  VB  NP 
NA  NP 
NA  NP 
NA  NP 
NA  VB  NP 
NA  VB  NP 
NA  NP 
NA  VB  NP 
NA  VB  NP 
NA  VB  NP 
NA  VB  NP 
NP 

NA  NP 
NP 

NA  NP 
NA  NP 
NP 

NA  NP 
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point  of  view  of  an  ’'inclusive”  part-of -speech.  By  an  "inclusive”  pavt-of-speoc’ 
string  is  meant  that  string  which  is  sure  to  contain  all  the  parts  of  speech  attributed 
to  the  word  by  either  dictionary,  but  which  may  also  contain  one  more  or,  rarely, 
two  more  parts  of  speech.  Since  use  of  inclusive  part  of  speech  becomes  necessary 
in  Task  3,  its  justification  will  be  discussed  when  Task  3  is  taken  up.  Words  ending 
in  cd  which  are  not  OT  but  arc  either  A J  or  VP  will  similarly  be  correct  from  an 
inclusive  part-of-spccch  viewpoint.  However,  some  non-past-participlcs  ending 
in  cd  are  NA.  Some  of  these  can  be  identified  by  the  use  of  suffixes,  to  be  discussed 
later.  All  others  are  given  in  Table  4-1.  Most  words  ending  in  single  s  will  have  the 
correct  inclusive  part-of-spcceh  assigned  by  the  Hulc  B  -  Rule  A  combination;  all 
exceptions  arc  also  given  in  Table  4-1.  Table  4-1  thus  contains  all  the  necessary 
exception  words  ending  in  s,  ing,  or  ed. 

Table  4-2  shows  participles,  past  tense  verbs,  and  plural  or  collective  nouns 
which  cannot  be  recognized  from  s  ing,  or  cd  endings.  It  is  a  subjective  list  derived 
from  the  1,380  or  so  entries  in  List  II  which  had  OT  designations.  To  make  automatic 
determination  of  part  of  speech  substantially  faster  than  dictionary  lookup,  the  excep¬ 
tion  lists  wore  kept  as  small  as  possible.  The  1,380  entries  in  List  II  with  OT  desig¬ 
nations  include  numerals,  obscure  collective  nouns  (e.g. ,  herb,  scrub),  words  which 
become  collective  only  when  s  is  added  (e.g. ,  geriatric),  and  some  errors  in  judgement 
by  the  keymmeher  as  well.  It  is  believed  that  this  list  can  safely  be  reduced  to  the 
words  shown  in  Table  4-2  without  dropping  below  the  goal  of  95  percent  accuracy  All 
of  the  irregular  particples  and  past  tense  verbs  have  been  retained,  but  only  a  partial 
list  of  collective  nouns  has  been  Included. 

Exceptions  to  Rule  C  were  founu  by  extracting  from  the  entire  dictionary  ail  words 
which,  though  ending  in  lg,  were  not  adverbs,  or  conversely,  though  not  ending  in  lx, 


4-9 


LOCKHEED  MISSILES  A  SPACE  COMPANY 


were  adverbs.  Contrary  to  expectations,  there  wore  a  large  number  of  such  words 
(slightly  over  1, 500).  Many  of  those  words  wore  judged  rate,  cr  rare  in  tha  usage  in 
question  (e. g. ,  dog-fly  as  NA,  or  dash,  pi,  rife,  smell,  thistle,  as  A V);  others  could 
be  predicted  by  an  extension  of  the  affix  lists,  to  be  discussed  later.  In  accordance 
with  the  philosophy  of  maintaining  a  relatively  short  exception  list  without  sacrificing 
too  much  accuracy,  this  list  of  1, 500  words  has  been  reduced  to  a  list  of  357  of  the 
common  words  which  are  exceptions  to  Rule  C,  ns  shown  in  Table  4-3. 

Task  2;  Tabulation  of  Special -Purpose  Words  Which  are  not  Covered  by  Rules  A, 

B,  or  C. 

For  Task  2,  List  II  was  again  used.  To  review,  List  11  contains  all  the  words 
which 

•  Have  at  least  one  standard  meaning  corresponding  to  a  part  of  sueech  other 
than  NA,  VB,  AJ,  or  AV  (the  parts  of  speech  assigned  by  Rules  A,  B,  C) 

•  Have  all  "irregular"  entries  removed  (fragments,  etc.' 

•  Have  ail  words  ending  in  od,  ing,  or  s  removed  (the  suffixes  covered  by 
Rule  B) 

By  extracting  from  List  II  all  words  with  standard  meaning  corresponding  to  a 
part  of  speech  PR,  CJ,  U,  or  PN  we  should  get  an  exhaustive  list  of  those  structural, 
sjxjcial -purpose  words  which  ars  so  important  in  a  mechanized  handling  of  English. 

Table  4-4  shows  the  249  function  words  so  extracted.  No'e  that  fig.  4*1  lists 
the  18  function  words  ending  in  s  or  mg.  Because  of  a  difficulty  in  sorting,  certain 
OT  words  (27)  which  are  irregular  adverbs  and  collective  nouns  are  incnwlt  !  in  this 
group,  although  they  should  ap|>cur  in  Fig.  4-3  instead.  Because  of  a  misunder* landing 
by  keypunched  in  the  original  creation  of  the  dictionary,  some  tnt|>ortani  pronouns  were 
not  so  classified  in  the  Merriam  Webster  parl-of-spccch  desigtialions  ami  arc  •herefore 
missing  from  the  B*l  (I,  your,  his,  we  them,  our  us,  their  they)  The  word  as  has 
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Table  4-3 


COMMON  EXCEPTION  WORDS  TO  RULE  C 


Word 

1>0S 

Word 

POS 

backhand 

N  AJ  V  AV 

broad 

N  AJ  AV 

bare -backed 

AJ  AV 

cheap 

N  AJ  AV 

bare-headed 

AJ  AV 

clean 

N  AJ  V  AV 

l>etween-whiles 

N  AV 

damn 

N  AJ  V  AV 

co- ally 

N 

double 

N  AJ  V  AV 

cock-sure 

AJ  AV 

cast 

N  AJ  V  AV 

counter-clockwise 

AJ  AV 

faint 

N  AJ  V  AV 

counter-current 

N  AJ  AV 

fair 

N  AJ  V  AV 

criss-cross 

N  AJ  V  AV 

false 

AJ  AV 

cross-country 

N  AJ  AV 

fine 

N  A.)  V  AV 

cross-grained 

AJ  AV 

flat 

N  A.T  V  AV 

double-quick 

N  AJ  V  AV 

flush 

N  AJ  V  AV 

free-hand 

N  AJ  AV 

forte 

N  AJ  AV 

god -damn 

N  AJ  V  AV 

foul 

N  AJ  V  AV 

half-and-half 

N  AJ  AV 

free 

AJ  V  AV 

half-way 

N  AJ  AV 

fresh 

N  AJ  V  AV 

happy -go- lucky 

N  AJ  AV 

front 

N  AJ  V  AV 

harum-scarum 

N  AJ  AV 

full 

N  AJ  V  AV 

now-a-days 

N  AJ  AV 

hard 

N  AJ  AV 

off-hand 

AJ  AV 

hence 

AV 

oft-times 

AV 

here 

N  AJ  AV 

old-fashioned 

N  AJ  AV 

heynne 

AV 

over -hard 

AJ  AV 

home 

N  AJ  V  AV 

over -long 

AJ  AV 

ill 

N  AJ  AV 

over-supply 

N  V 

just 

N  AJ  V  AV 

point-blank 

N  AJ  AV 

keen 

N  AJ  V  AV 

post-haste 

N  AJ  AV 

large 

N  AJ  AV 

pot-belly 

N 

last 

N  AJ  V  AV 

right-handed 

AJ  AV 

late 

AJ  AV 

rough-and-tumble 

N  AJ  AV 

lax 

N  AJ  V  AV 

second-class 

N  AJ  AV 

least 

X  AJ  AV 

side-saddle 

N  VAV 

long 

N  AJ  V  AV 

single-handed 

AJ  AV 

loose 

N  AJ  V  AV 

sky-high 

AJ  AV 

loud 

AJ  AV 

so-and-so 

N  AJ  AV 

low 

N  AJ  V  AV 

topsy-turvy 

N  AJ  V  AV 

maylje 

N  AJ  AV 

under-arm 

AJ  AV 

mcr  n 

S  AJ  V  AV 

up-country 

N  AJ  AV 

much 

N  AJ  AV 

up-grade 

N  V  AV 

needs 

AV 

up-stream 

AJ  AV 

new 

N  AJ  AV 

up-wind 

N  AJ  V  AV 

not>e 

N  AV 

ait 

N  AJ  AV 

north 

N  AJ  V  AV 

back 

N  U  V  AV 

odd 

N  A!  AV 

bad 

N  AJ  AV 

oft 

AV 

blind 

N  AJ  V  AV 

old 

N  AJ  AV 
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Word 

POS 

Word 

POS 

part 

N  AJ  V  AV 

broadcast 

N  AJ  V  AV 

pat 

N  AJ  V  AV 

broadside 

N  AJ  V  AV 

prompt 

N  AJ  V  AV 

broadway 

N  AJ  AV 

queer 

N  AJ  V  AV 

complete 

A-I  V  AV 

quick 

N  AJ  V  AV 

costly 

AJ 

quite 

N  AV 

counter 

N  AJ  V  AV 

real 

N  AJ  AV 

curly 

AJ 

right 

N  AJ  V  AV 

direct 

N  AJ  V  AV 

sic 

N  V  AV 

dirty 

AJ  V  AV 

snug 

N  AJ  V  AV 

doily 

N 

soon 

AJ  AV 

doubtless 

AJ  AV 

sour 

N  AJ  V  AV 

earthly 

AJ 

square 

N  AJ  V  AV 

even 

N  AJ  V  AV 

straight 

N  AJ  V  AV 

ever 

AV 

thence 

AV 

farther 

AJ  V  AV 

twice 

N  AJ  AV 

farthest 

AJ  AV 

west 

N  AJ  V  AV 

iurt  tier 

AJ  V  AV 

worse 

N  AJ  AV 

furthest 

AJ  AV 

wrong 

N  AJ  V  AV 

galore 

N  AJ  AV 

yea 

N  AV 

gratis 

AJ  AV 

yi'P 

AV 

gully 

N  V 

yes 

N  V  AV 

hearty 

N  AJ  AV 

ablaze 

AJ  AV 

heaven 

N  AJ  V  AV 

adrift 

AJ  AV 

herein 

AV 

afield 

AJ  AV 

hereof 

AV 

aground 

AJ  AV 

higher 

N  AJ  AV 

ajar 

AJ  AV 

highest 

N  AJ  AV 

akin 

AJ  AV 

hilly 

AJ 

alias 

N  AV 

holly 

X  AJ 

alike 

AJ  AV 

holy 

N  AJ 

alive 

AJ  AV 

imply 

t  ' 

almost 

AJ  AV 

indeed 

alone 

AJ  AV 

indoor 

AJ  AV 

aloud 

AV 

indoors 

AV 

always 

AV 

jelly 

N  V 

amuk 

N  AJ  AV 

July 

N 

andante 

N  AJ  AV 

largo 

N  AJ  AV 

apart 

AJ  V  AV 

later 

N  AJ  AV 

apiece 

AV 

latest 

X  AJ  AV 

aright 

AV 

lengthways 

AV 

askew 

AJ  AV 

lento 

AJ  AV 

astray 

AJ  AV 

tesser 

AJ  AV 

away 

AJ  AV 

lily 

N  AJ 

awful 

AJ  AV 

longways 

N  AV 

awhile 

AV 

lower 

X  AJ  V  AV 
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Word 

POS 

Word 

1K>S 

lowest 

N  AJ  AV 

thereat 

AV 

matchless 

AJ  AV 

thereof 

AV 

measly 

AJ 

threefold 

AJ  AV 

merry 

X  AJ  AV 

tidy 

N  AJ  V  AV 

middling 

N  AJ  AV 

topside 

N  AJ  AV 

midstream 

N  AV 

twofold 

N  AJ  AV 

mighty 

N  AJ  AV 

upright 

N  AJ  V  AV 

molly 

N 

very 

N  AJ  AV 

never 

AV 

vivace 

AJ  AV 

nohow 

AV 

weary 

AJ  V  AV 

noways 

AV 

wellnigh 

AV 

offshore 

AJ  AV 

whereat 

AV 

offside 

N  AJ  AV 

wherein 

AV 

often 

AV 

whereof 

AV 

oi>cn 

N  AJ  V  AV 

vhercon 

AV 

outboard 

N  AJ  AV 

wily 

AJ 

outright 

AJ  AV 

abundant 

AJ  AV 

perchance 

AV 

adagio 

N  AJ  AV 

perforce 

N  AJ  AV 

alluttcr 

AJ  AV 

|K>rhaps 

N  AV 

afterward 

N  AV 

piano 

N  AJ  AV 

afterwards 

N  AV 

plenty 

X  AJ  AV 

aglitter 

AJ  AV 

pronto 

AV 

akimbo 

AJ  AV 

proper 

N  AJ  AV 

alibi 

X  V  AV 

rally 

N  V 

alongshore 

N  AJ  AV 

ready 

N  AJ  V  AV 

already 

AV 

reckless 

AJ  AV 

amidships 

AJ  AV 

reply 

N  V 

anywhere 

N  AV 

rest  less 

N  AJ  AV 

apriori 

X  AJ  AV 

reverse 

N  AJ  V  AV 

bare  kick 

AJ  AV 

sally 

N  V 

barefoot 

AJ  AV 

scaly 

AJ 

butterfly 

N  AJ  V 

seldom 

AJ  AV 

careless 

N  AJ  AV 

sheepish 

AJ  AV 

cowardly 

AJ  V 

slantways 

AV 

crescendo 

X  AJ  V  AV 

slantwise 

AJ  AV 

elsewhere 

AV 

smelly 

AJ 

evermore 

AV 

sooner 

X  AV 

extempore 

AJ  AV 

s|K?etiy 

AJ  AV 

falsetto 

N  AJ  AV 

starboard 

N  AJ  V  AV 

family 

X  AJ 

steadfast 

AJ  AV 

forehand 

> 

< 

-3 

< 

steady 

V. 

> 

t- 

< 

> 

< 

foremost 

AJ  AV 

sudden 

X  AJ  AV 

forever 

N  AV 

sully 

V 

fontando 

AJ  AV 

tally 

N  V 

furthermore 

AV 
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Word 

POS 

Word 

POS 

henceforth 

AV 

therefore 

N  AV 

hereabout 

AV 

thereto 

AV 

hereafter 

N  AV 

thereupon 

AV 

hereby 

AV 

thousandfold 

N  AJ  AV 

hitherto 

AJ  AV 

twelvefold 

AJ  AV 

homily 

N 

unaware 

AJ  AV 

however 

AV 

underground 

N  AJ  AV 

howsoever 

AV 

underhand 

N  AJ  V  AV 

hundredfold 

N  AJ  AV 

ungodly 

AJ 

impromptu 

N  AJ  V  AV 

unholy 

N  AJ 

inasmuch 

AV 

unruly 

AJ 

innuendo 

N  V  AV 

unsightly 

AJ 

insomuch 

AV 

unworldly 

AJ 

legato 

N  AJ  AV 

up|>ermost 

AJ  AV 

lentamente 

AJ  AV 

upriver 

AJ  AV 

lifelong 

AJ  AV 

verbatim 

X  AJ  AV 

m  anyways 

AV 

whereabout 

X  AV 

miserly 

AJ 

whereby 

AV 

nevermore 

AV 

wherefore 

X  AV 

ninefold 

X  AJ  AV 

whereupon 

AV 

outermost 

AJ  AV 

wholesale 

X  AJ  V  AV 

overboard 

AV 

yesterday 

X  AJ  AV 

overhand 

N  AJ  V  AV 

altogether 

N  AJ  AV 

overhead 

N  AJ  AV 

Ijcforchand 

AJ  AV 

ovt  land 

N  AJ  AV 

contrariwise 

AJ  AV 

overnight 

N  AJ  V  AV 

everyway 

AV 

overtime 

N  AJ  V  AV 

everywhere 

X  AV 

piecemeal 

N  AJ  V  AV 

fortissimo 

N  AJ  AV 

sevenfold 

AJ  AV 

henceforward 

AV 

sforzando 

X  AJ  AV 

heretofore 

N  AJ  AV 

storzato 

AJ  AV 

incognito 

X  AJ  AV 

side  way 

N  AJ  AV 

malapropos 

X  AJ  AV 

sideways 

AJ  AV 

melancholy 

N  AJ 

sixty  fold 

N  AJ  AV 

moderate 

AJ  A  ’ 

somehow 

AV 

monopoly 

X 

sometime 

AJ  AV 

nevertheless 

AV 

someway 

AV 

oftentimes 

AV 

somewhere 

X  AV 

pianissimo 

N  AJ  AV 

staccato 

X  AJ  V  AV 

pizzicato 

N  AJ  AV 

Straightaway 

X  AJ  AV 

prestissimo 

N  AJ  AV 

thenceforth 

AV 

sometimes 

AV 

thereabout 

AV 

thenceforward 

AV 

the  rcaiout* 

AV 

unaware* 

AV 

thereafter 

AV 

underhanded 

AJ  AV 

thereby 

AV 
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SPECIAL-FUNCTION  WORDS 


Word 

POS 

Word 

POS 

he 

N  AJ  V  PN  U 

if 

N  CJ 

she 

S’  AJ  PN  OT 

self 

N  AJ  V  PN 

the 

AJ  AV  OT 

of 

N  PR  OT 

me 

N  PN 

stag 

N  AJ  V  AV  OT 

a 

N  AJ  V  PR  OT 

dang 

N  AJ  V  AV  OT 

dead 

N  AJ  AV  OT 

whang 

N  V  AV  OT 

mid 

X  AJ  AV  PR 

each 

AJ  AV  PN 

bold 

N  AJ  AV  OT 

which 

AJ  PN 

and 

N  AV  CJ 

rich 

N  AJ  AV 

beyond 

N  AV  PR 

such 

AJ  AV  PN 

round 

N  AJ  V  AV  PR 

nigh 

N  AJ  V  AV  PR 

thud 

N  V  AV  IJ 

though 

AV  CJ 

whence 

N  AV  CJ 

through 

N  AJ  AV  PR 

since 

AV  PR  CJ 

plash 

N  V  AV  IJ 

once 

N  AJ  AV  CJ 

swash 

N  AJ  V  AV  U 

Ik)  unco 

N  V  AV  U 

swish 

N  AJ  V  AV  IJ 

jee 

N  AV  IJ 

with 

N  AV  PR 

strange 

AJ  AV  IJ 

both 

N  AJ  AV  CJ  PN 

1  iUe 

N  AJ  V  AV  PR  CJ  OT 

south 

N  AJ  V  AV  PR 

while 

N  V  CJ 

crock 

N  AJ  V  AV  U 

vile 

AJ  AV  OT 

stock 

N  AJ  V  AV  OT 

same 

N  AJ  AV  PN 

rank 

N  AJ  V  AV  OT 

some 

AJ  AV  PN  OT 

plunk 

N  V  AV  IJ 

thine 

PN  OT 

whisk 

N  V  AV  IJ 

mine 

N  AJ  V  PN 

all 

N  AJ  AV  PN  OT 

one 

N  AJ  V  PN  OT 

fell 

N  V  AV  PV 

none 

N  AV  PN  OT 

well 

N  ;U  V  AV  OT 

prono 

N  AJ  AV  OT 

till 

N  V  PR  CJ 

woe 

N  AV  IJ  OT 

still 

N  .AJ  V  AV  CJ 

ere 

N  AV  PR  CJ 

him 

N  PN 

there 

N  AJ  AV  PN 

whom 

PN 

where 

N  AV  CJ  PN 

from 

PR 

maugro 

AV  PR 

cum 

AJ  AV  PR 

tore 

N  AJ  AV  PR  IJ 

than 

PR  CJ 

more 

N  AJ  AV  PN 

l>ccn 

PV 

wise 

N  AJ  V  AV  OT 

then 

X  AJ  AV  CJ 

whose 

AJ  PN  OT 

when 

X  AV  CJ  PN 

ante 

N  V  AV  PR 

in 

X  AJ  V  AV  PR 

save 

N  V  PR  C  5 

lain 

PV 

bovq 

AV  PR 

on 

X  AJ  AV  PR 

aye 

N  AV  IJ 

con 

X  AJ  V  AV  PR 

oil 

N  AJ  V  AV  PR 

down 

X  AJ  V  AV  PR 
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Table  4-4  (Cont. ) 


Word 

POS 

Word 

POS 

wito 

NPNU 

how 

N  AV  CJ  IJ 

no 

N  AJ  AV 

now 

N  AJ  AV  CJ 

pto 

N  AJ  AV  PR 

ay 

NAVU 

80 

N  AJ  AV  CJ  PN 

by 

N  AJ  V  AV  PR 

to 

n  av  p:\ 

why 

N  AV  U 

sweep 

N  V  AV 

whizz 

X  V  AV  U 

plop 

N  V  AV  IJ 

Utia 

N  AV 

pop 

N  AJ  V  AV  U 

supra 

AV  PR 

up 

N  AJ  V  AV  PR 

contra 

N  AV  PR 

l>ar 

N  AJ  V  PR 

instead 

AV  OT 

dear 

N  AJ  V  AV  U 

abroad 

AJ  AV  PR 

near 

N  AJ  V  AV  PR 

amid 

PR 

her 

N  AJ  PN 

inland 

N  AJ  AV  OT 

per 

AJ  PR 

behind 

N  AJ  AV  PR 

or 

N  AJ  Pit  CJ 

around 

AJ  AV  PR 

for 

N  PR  CJ 

aix>a  rd 

AV  PR 

nor 

CJ 

toward 

AJ  AV  PR 

whirr 

N  V  AV  U 

astride 

AV  PR 

at 

N  AV  PIt  CJ  PN 

aside 

X  AV  PR 

neat 

N  AJ  AV  OT 

beside 

AV  PR 

great 

N  AJ  AV  OT 

inside 

N  AJ  AV  PR 

that 

N  AJ  AV  CJ  PN 

outside 

N  AJ  AV  PR 

what 

N  AJ  AV  CJ  PN  U  OT 

unlike 

N  AJ  AV  PR  CJ 

wet 

N  AJ  V  AV  OT 

before 

N  AJ  AV  PR  CJ  OT 

yet 

AJ  AV  CJ 

because 

AV  CJ 

left 

N  AJ  AV  OT 

despi.e 

X  PR 

light 

N  AJ  V  AV  OT 

aljove 

X  AJ  AV  PR 

aught 

N  AV  PN 

himself 

X  PN 

caught 

AJ  V  PA  OT 

herself 

PN 

ought 

N  V  AV  PN  OT 

ourself 

PN 

tt 

N  PN 

yourself 

PN 

not 

N  AV  PR 

itself 

»»N 

griol 

N  AJ  V  AV  OT 

myself 

PN 

fast 

N  AJ  V  AV  U 

along 

AV  PR 

JKiSt 

N  AJ  AV  PR  OT 

endlong 

AV  PR 

initial 

N  AV  PR 

among 

PR 

best 

N  AJ  V  AV  OT 

anigh 

AV  PR 

leal 

CJ 

although 

CJ 

man 

N  AJ  AV  P,v  OT 

enough 

N  AJ  AV  U 

but 

N  AJ  AV  PR  CJ  PN 

awash 

AJ  AV  OT 

out 

N  AJ  V  AV  PR  IJ 

beneath 

AJ  AV  PR 

bout 

X  AJ  V  AV  PR 

argal 

X  AV  CJ 

next 

X  AJ  AV  PR 

until 

PR  CJ 

you 

X  PN  OT 

hitkkm 

PV 
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Table  4-4  (Cont. ) 


Word 

POS 

Word 

POS 

between 

N  AV  PR 

without 

N  AV  PR 

amen 

NVAVIJ 

bclwtxt 

AV  PR 

certain 

N  AJ  AV  or 

adieu 

N  AV  U 

within 

N  AJ  AV  PR 

below 

N  AJ  AV  PR 

u|»n 

PH 

midway 

X  AJ  AV  PR 

ago 

AJ  AV 

bully 

N  AJ  V  AV  U 

into 

PR 

only 

/\J  AV  PU  CJ 

presto 

N  AJ  V  AV  U 

any 

AJ  AV  PN  OT 

asleep 

AJ  AV  OT 

alongside 

aV  PR 

atop 

AJ  AV  PH 

opjiosite 

N  AJ  AV  PR 

ancar 

AV  PR 

oneself 

PN 

yonder 

AJ  AV  PS  kji 

sidelong 

AJ  AV  PU 

under 

N  AJ  AV  PH 

underneath 

X  AJ  AV  PR 

rather 

AV  OT 

wherewith 

X  AV  PN 

whether 

N  AV  CJ 

unequal 

X  AJ  AV  OT 

cither 

AJ  AV  CJ  PN 

overall 

X  AJ  AV  OT 

neither 

AJ  AV  CJ  PN 

unt>eknown 

N  AJ  AV  OT 

whither 

AV  CJ 

another 

AJ  PN 

other 

N  AJ  AV  PN 

whichever 

AJ  PN 

after 

N  AJ  AV  PR  CJ 

whomever 

PN 

better 

N  AJ  V  AV  OT 

whenever 

AV  CJ 

whoever 

PN 

whensoever 

AV  CJ 

over 

N  AJ  V  AV  PR 

whosoever 

AJ  PN 

atour 

AV  PR 

wherever 

AV  Cl 

abaft 

AV  PR 

whatever 

AJ  AV  PN 

outwrought 

PA  OT 

somewhat 

N  AV  PN 

albeit 

CJ 

unliot  bought 

AV  in 

howheit 

AV  CJ 

amidmost 

AV  PR 

aslant 

AJ  AV  PR 

undermost 

AJ  AV  OT 

except 

V  PR  CJ 

anyhow 

AV  OT 

athwart 

AV  PR 

anyway 

AV  OT 

amort 

AV  OT 

himonlhly 

N  AJ  AV  OT 

amidst 

X  AV  PR 

mslaiuiv 

AV  CJ  OT 

amongst 

PR 

oxerteratc 

AJ  V  AV  OT 

against 

PR 

wherewithal 

N  AV  PN 

midmost 

N  AJ  AV  PR 

anybody 

X  PN 

aoust 

N  AJ  AV  OT 

every body 

PN 

about 

AJ  AV  PR 

immediately 

AV  CJ 

throughout 

AJ  AV  PR 

4-17 


Lockheed  Mtss 6  s®ace  company 


been  lost  in  the  sorting  process.  No  other  significant  omissions  have  been  noted,  but 
ore  of  course  possible  since  checking  of  the  tape  dictionaries  was  not  exhaustive.  The 
parts  of  speech  given  in  Tables  4-1  through  4-4  were  taken  from  the  ape  dictionary 
and  have  been  verified  in  the  dictionaries  themselves. 

Task  3:  Modification  of  Rule  A  Using  a  Study  of  Affixes 

Rule  A  is  based  upon  a  general  observation.  The  business  of  Task  3  is  to  discover 
it  is  possible,  by  considering  prefixes  and  suffixes  (which  might  well  be  expected  to  be 
key  structural  elements  indicative  of  syntactic  roles),  to  convert  a  general  rule 
evidently  effective  in  a  majority  of  eases  to  an  cxha^Uve  rule  effective  for  U5  percent 
of  the  words.  It  was  first  necessary  to  develop  a  formal  ami  reproducible  definition  of 
prefixes  and  suffixes,  as  is  described  in  The  Nature  of  Affixing  in  Written  English’* 
and  Structural  Definition  of  Affixes  in  Multisyllable  Words.**  ft  was  then  necessary  to 
investigate  tho  extent  of  the  correlation  between  affixes  and  part-of-speoeh,  us 
described  in  Part -of -Speech  Implications  of  Affixes.  The  results  of  the  correlation 
can  be  briefly  described  here. 

All  words  with  part  of  speech  AV,  PR,  l\V,  N'P.  JJ,  PA.  Pl\  VI*. .ami  CJ' ‘can  be 
automatically  assigned  part  of  speech  by  reference  to  the  word  fists  in  Tables  4-1 
through  4*4,  followed  by  appl .cation  of  Rules  R  and  C  for  words  not  in  these  fists. 

Part -of -Speech  implications  of  Affixes  was  therefore  concerned  only  with  words  whose 
part  of  speech  string  contained  the  elements  SA,  A4,  and  V l\  which  allows  the  five 
possible  combinations  VR,  XA.  A4,  SA-VR.  A4-V&,  XA-A4  being  considered  vtpmateh. 
to  NA.  Attempts  to  establish  a  9'  percent  correlation  between  the  part  '  spree h  string 
of  a  word  and  its  affixes  failed.  However.  it  was  noted  that  the  correlation  was  closer 
for  four-  to  seven  -syllable  words  than  for  two-  to  three -syllable  words,  and  that  a  very 
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good  correlation  could  lie  obtained  (or  all  words  between  on  "inclusive*'  part-of-sp^cch 
string  and  the  affixes.  Thus  in  some  cases  affix  -v owe  1  -  s  tr  i  ng  considerations  enable  aa 
absolute  identification  of  the  pari  of  speech  of  a  word,  but  in  other  eases  identification 
is  to  a  m<" ”c  inclusive  set.  For  example,  a  NA  or  a  VB  may  be  clarified  as  NA-VB 
or  an  AJ  may  be  classified  as  a  NA.  Such  a  cla;  dficuUon  is  justifiable  on  the 
following  grounds: 

•  It  is  the  natural  task  of  a  syntactic  analysis  program  to  choose  among  several 
|x>ssibic  parts  of  speech,  and  it  is  easier  to  do  so  tiian  to  sufiply  a  missing 
part  of  speech. 

•  Oictiona rics  arc  very  rciiabic  in  the  information  explicitly  given,  but  implica¬ 
tions  inferred  from  the  al*cncc  of  information  arc  less  reliable.  Thus  the 
inclusive  part  of  speech  string  assigned  by  the  algorithm  may  in  some  cases 
lie  more  correct  than  the  more  limited  one  assigned  by  a  particular  dictionary. 
In  our  experience  with  the  SOX  amt  MW3  dictionaries  we  found  many  instances 
of  iKmagrcement:  usually  one  more  inclusive  than  the  other. 

in  Bart -of -Speech  Implications  of  Affixes,  the  results  of  the  correlation  study  are 
given  for  72  prefixes  and  S7  suffixes  implications  arc  of  the  form  NA,  or  SA-VB,  or 
VB  or  A4.  For  41  of  the  affixes,  the  port -of -speech  implication  changes  with  tin?  length 
of  the  woeti,  from  NA-VB  for  two*  and  three -syllable  wt»r*ls  to  NA  for  lour-  to  eight - 
syllabic  wonts. 

l^atcr  a  correlation  was  made  for  the  affixes,  previo  usly  mentioned,  which  seemed 
to  be  likely  candidates  for  reducing  the  exception  lists  by  aiding  in  the  identification  of 
advert*  or  in  the  identification  of  words  ending  in  ed  which  are  not  past  participles. 
Though  not  operationally  defined  these  affixes  are  of  practical  importance  and  are 
ti^crefore  listed  belmc.  with  their  pan-of-spevefc  implications. 
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Prefixes 

POS 

Suffixes 

POS 

north 

NA  AV 

seed 

NA 

south 

NA  AV 

weed 

NA 

west 

NA  AV 

like 

NA  AV 

a- 

A  T  AV 

wise 

AJ  AV 

ward 

NA  AV 

wards 

NA  AV 

-fly 

NA 

TESTING  AND  EVALUATION 

Rules  A,  B,  and  C,  the  exception  lists,  and  the  prefix  and  suffix  implications 
reported  in  Reference  7  were  incorporated  into  a  computer  program  for  testing  the 
algorithm.  In  the  program  the  exception  lists  were  checked  first,  then  the  word  was 
separated  into  kernel  and  affix  parts,  then  rules  B  and  C  and  the  other  affix  rules  were 
executed,  and  finally  rule  A  was  applied  to  all  words  still  without  a  part-of-speech 
assignment.  There  are  some  complications  involved  in  some  of  these  steps,  partic¬ 
ularly  in  separating  a  word  into  kernel  and  affix  parts,  and  in  assigning  parts  of  speech 
on  the  basis  of  affixes.  The  logic  used  by  t  e  program  for  these  steps  is  given  in 
Fig.  4-1. 

To  summarize  briefly,  the  criteria  by  which  an  affix  sequence  was  accepted  as  an 
affix  in  a  given  word  was  the  same  as  that  given  in  Reference  7.  Prefixes  were  given 
priority  in  the  stripping  of  affixes  from  the  kernel,  but  suffixes  were  given  priority  in 
assigning  the  pai\s  of  speech  of  the  word  (as  is  also  explained  in  Reference  7). 

2 

To  test  the  algorithm,  500  words  were  chosen  at  random  from  the  tape  dictionary  ’ 
and  the  parts  of  speech  assigned  by  the  algorithm  were  compared  with  those  given 
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BEGIN 


Determine  the  no.  of 
vowel  strings  in  XTAB 
and  store  value  in  XLEN 


In  this  value  -  0  ? 


Check  if  previously 
determined  there 
are  no  suffixes 


Was  there  a 
yC^  prefix  found? 


!  Check  if  suffix  and/or 
kernel  fulfills  condi¬ 
tions.  Make  any 
changes  desired  in 
either 


Set  signal  T  suffix 


yes^  Test  if  a  transformational 
— ■  ■-  suffix  matches  with  the 
P*"*  termination  of  KERNEL 


Test  if  a  normal  no 
suffix  matches  j 


Finished  finding 
affixes 


Set  signal  R  suffix 


{I  yes  no  yes  jno 
Tested  all  Tested  all] 


Put  suffix  into 
TENTS  and  rest  of 
KERNEL  into  ZTAB 


Test  terminal  consonant 


R  suffixes  j  ]  T  suffixes  |  string  of  ZTAR  is  legal 
1  i  , —  or  blank 


Negate  suffix. 
Was  this  a 
T  prefix? 


Determine  the  no,  of 
vowel  strings  in 
ZTAB  and  store  no. 
into  ZLEN 


Is  this  value  =  0 


Fig.  4-1  Search-for-Ai'fixe  s  Flow  Diagram  (Cont.) 
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Fig.  4-1  Search- for-Afi’ixcs  Flow  Diagram  (Cont.) 
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Fig.  4-1  Search-i'or-Aifixes  Flow  Diagram  (Conte) 
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in  the  dictionary.  If  dialectal,  obsolete,  archaic,  and  rare  words  causing  errors  are 
removed,  and  if  program  errors  are  corrected,  results  are  as  follows: 

Category  Number  of  words  in  category 


Assigned  POS  matches  dictionary  POS 

271 

Extra  POS  assigned 

190 

Missing  POS 

ld 

POS  does  not  match  at  all  -  Error 

6 

Total  sample 

491 

This  shows  that  95. 1  percent  of  the  words  were  assigned  the  correct  inclusive 
part  of  speech  and  55.  2  percent  were  assigned  parts  of  speech  exactly  coinciding  with 
those  assigned  by  the  dictionary.  Thus,  the  goal  of  95  percent  has  just  been  achieved. 

It  is  interesting  to  consider  how  little  the  affix  implications  have  improved  the 
results  foi*  this  sample.  Taking  the  first  192  of  the  500  alphabetized  words  and  applying 
the  original  Rules  A,  B,  and  C  only,  20  words  arc  shifted  into  the  exact  match  category 
and  25  words  from  the  exact  match  category  for  a  net  loss  of  5  words,  where  2  of 
these  go  into  the  error  category.  Six  words  arc  added  to  t  ie  words  with  missing  part 
of  speech  while  two  words  arc  taken  out  of  the  category.  Thus  the  total  loss  is  4  more 
words  into  the  missing  category  and  2  more  words  into  the  error  category,  or  about 
a  3  percent  loss  from  the  point  of  view  of  inclusive  part  of  speech.  Rule  A,  it  will  be 
remembered,  requires  the  removal  of  affixes  from  the  kernel  of  the  word.  If  this 
kernelizing  of  the  word  is  omitted,  there  is  about  a  13  percent  loss  from  the  jx>int  ol 
view  of  inclusive  part  of  speech,  indicating  that  the  fact  that  a  word  is  affixed  is  more 
important  in  predicting  part  of  speech  than  what  the  affix  is  (the  affixes  ing,  od,  and  ly 
excepted).  Nevertheless,  using  the  implications  of  affixes  is  a  refinement  in  an  area 
where  refinement  is  sorely  needed. 
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It  might  be  interesting  at  this  point  to  evaluate  the  two  original  premises,  that 
elementary  words  are  largely  noun-verb  and  all  other  words  are  largely  noun  only.1 
To  test  the  first  premise,  the  standard  one-vowel-string  words  in  the  tape  dictionary 
were  divided  into  two  sections,  those  which  were  NA-VB  (and  only  NA-VB)  and  those 
which  were  not.  (The  OT  category  wa3  ignored. )  There  were  2, 520  words  in  the 
NA-VB  category  and  1,925  words  with  more  or  icss  parts  of  speech  than  NA-VB.  The 
1,925  word  list  includes  the  132  one-syllable  members  of  the  word-class  with  parts  of 
speech  PR,  CJ,  U,  PN,  and  PV  listed  in  Table  4-4.  Discounting  these  132  function 
words  then,  the  first  premise  is  true  for  2,  520  out  oi'  4,313  cases,  or  about  58  percent. 
To  get  95  percent  of  ihe  one-syllable  words  assigned  as  in  the  dictionary,  most  of  the 
1,793  non  NA-VB  words  would  have  to  lie  in  an  exception  dictionary.  However,  since 
most  of  these  are  NA,  from  the  ix>int  of  view  of  inclusive  part  of  speech,  the  NA-VB 
rule  for  elementary  words  is  quite  good,  giving  results  very  close  to  those  obtained  in 
the  500-word  random  sample  of  all  words  (55  percent  exactly  matching  dictionary. 

95  percent  giving  correct  inclusive  part  of  speech). 

The  second  premise  has  not  been  directly  tested,  but  may  be  inferred  from  the 
500-word  random  sample,  since  we  have  just  proven  that  the  one-syllable  words  (there 
are  40  in  the  sample)  do  not  affect  the  results  substantially.  In  its  general  form  the 
second  premise  is  true  about  70  percent  of  the  time,  as  is  reported  in  Reference  i. 

In  its  modified  form  as  stated  in  Rule  A,  and  tested  by  our  500-word  sample,  it  is  true 
for  only  about  55  to  GO  percent  of  the  cases,  but  is  good  for  about  00  to  05  percent  of 
the  cases  from  the  point  of  view  of  inclusive  part  of  speech,  with  something  less  than 
5  jiercent  variation,  depending  on  whether  or  not  part  of  sjieecii  implications  of  affixes 
are  used. 


4-2G 


LOCKHEED  MISSILES  ft  SPACE  COMPANY 


t 


SUMMARY 


The  net  l’csult  of  the  parl-of-speeeh  studies  is  an  algorithm  which,  used  in  con¬ 
junction  with  a  dictionary  ol  less  than  800  words  and  an  affix  list  of  loss  than  200, 
gives  a  correct  "inclusive"  part  of  speech  for  95  percent  of  a  500-word  random  sample, 
and  which  should  do  better  on  textual  material.  The  dictionary  is  derived  from  an 
exhaustive  compilation  of  words  which  the  algorithm  is  not  capable  of  handling.  Such 
words  arc  either  adverbs,  function  words,  participles,  or  collective  nouns  not  recog¬ 
nized  by  the  program,  or  conversely,  words  so  classified  which  should  not  be.  The 
number  of  words  in  the  exhaustive  list  is  3,  1G3,  of  which  only  754  were  selected  for 
the  dictionary.  However,  as  explained  in  the  body  of  the  text,  all  of  the  2G7  function 
words  with  parts -of-speech  other  than  N A,  AJ,  VD,  or  AV  have  been  included,  as  have 
all  of  the  irregular  past  verbs  and  past  participles  and  the  more  commonly  used  adverbs 
and  collective  nouns.  The  2,409  words  omitted  are  mainly  less  common  adverbs  and 
collective  nouns  and  they  comprise  only  about  3  percent  of  the  total  73,582  word 
dictionary. 
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3.  Webster's  Third  New  International  Dictionary  of  the  English  Language.  Springfield, 
Mass.,  Cl.  C.  Merriam  Company,  lHiblishers,  1901 

4.  H.  W.  Fowler.  A  Dictionary  of  Modern  English  Usage,  2nd  ed. ,  revised  and 
edited  by  Sir  Ernest  (lowers,  New  York  .uid  Oxford,  Oxford  University  Press.  19(»5 
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5.  H.  Rcsnikoff  and  J.  Dolby,  "The  Nature  of  Affixing  in  Written  English," 

Mechanical  Translation,  8,  June  and  October  19G5 
G.  L.  L.  Earl,  "Structural  Definition  of  Affixes  in  Multisyllable  Words"  (in  manuscript) 
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AUTOMATIC  INDEXING  USING  COMBINED  SYNTACTIC 
AND  ENTROPY  SELECTION  CRITERIA 


5.  PROGRESS  REPORT  ON  A  SYNTACTIC -STATISTICAL  METHOD 

FOR  AUTOMATIC  INDEXING 

L.  L.  Earl 

A  method  for  automatic  indexing  has  been  deveiojxxl  with  two  basic  aims  in  mind. 
The  first  aim  has  been  to  provide  a  two-ievoi  index: 

Level  (l):  Index  terms  wiiieh  are  descriptive  of  the  subject  matter  yet  represent 
a.  drastic  reduction  of  volume 

Level  {2).  Index  pit  rases,  arranged  under  the  terms,  which  consist  of  selected 
phrases  from  the  text  containing  the  terms  and  which  give  a  more 
complete  picture  of  the  subject  matter 

These  two  levels  would  allow  selective  storage  or  retrieval  dcjxmding  on  the  capacities 
of  the  system  or  the  needs  of  the  individual  information  seeker.  With  the  present 
algorithm  the  reduction  on  the  first  level  is  to  somewhere  between  0.  On  and  0.  per¬ 
cent  of  tiic  original  volume,  and  on  the  second  level  to  somewhere  between  0.  a  and 
3  jierccnt.  The  wide  range  ot  reduction  on  both  levels  has  to  ilo  with  the  second  aim, 
which  is  to  adjust  tlic  density  of  terms  and  phrases  to  coi  respond  with  the  information 
content  of  the  text. 

In  die  method  <leveiojH.nl.  the  text  is  first  reduced  bv  selecting  a  group  of  phrases 
on  the  basis  of  their  syntactic  function;  the  noun  phrases  selected  are  the  subjects  of 
verbs,  and  the  objects  or  predicate  complements  of  verb*  or  infinitives,  together  with 
any  modifying  genitive  phrases-  Fietjueue>  counts  are  taken  of  the  remaining  words, 
then  index  terms  are  taken  tr*»m  the  word*  with  highest  frexpiency  counts,  using  entropy 
criteria  to  determine  ti  e  number  of  index  terms  chosen.  Deriving  suitable  criteria  to 
meet  the  second  haste  atm.  adjustment  of  the  density  of  index  phrases  according  to  the 
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level  of  information  content,  has  been  a  major  concern.  Five  texts  have  been  used  as 
samples  for  experimental  purposes: 

Text  I  -An  excerpt  from  an  electronics  textbook,  concerned  with  amplifiers 

Text  II  -  An  excerpt  from  a  text  on  information  processing,  concerned  with  the 
encoding  of  speech  and  pictures 

Text  in  -  A  talk  given  by  a  college  president  on  the  inauguration  of  a  department 
of  synnoetics 

Text  IV  -  A  chapter  from  a  book  about  the  personal  history  of  the  scientists  who 
developed  the  atomic  bomb  (titled  Brighter  Than  a  Thousand  Suns) 

Text  V  -  A  journal  article  on  a  technique  in  processing  of  lists  by  computer 
(titled  "Multiword  List  Items") 

The  indexes  produced  using  the  current  algorithm  are  shown  in  Figs.  5-1  through  5-5; 
each  of  the  five  texts  has  been  reproduced  in  Figs.  5-6  through  5-10.  The  first  six 
letters  of  each  index  term  appear  on  the  left.  The  index  phrases  containing  the  index 
terms  are  listed  underneath  the  terms,  indented  7  spaces.  In  the  opinion  of  the  author, 
results  are  best  for  the  textbook  excerpts  (texts  I  and  II),  satisfactory  for  the  light  and 
low  content  (texts  III  and  IV),  and  poorest  on  the  more  specialized  and  abstract  journal 
article. 
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amplif 

USc  OF  MULTIPLE  amplifier  stages 
POrfFR  AMPLIFIERS 
. . . grounded  grid  AMPLIFIER 

envelope  RESPONSE  CHARACTERISTICS  of  band  pass  AMPLIFIERS 
ouiput  of  one  amplifier 
-  practical  push'pull  class  b  amplifiers 

TWO  AMPLIFIERS  . 

ENVELOPE  RESPONSE  OF  BAND  PASS  AMPLIFIER 

vidfg  amplifier 

PUSH'PULL  AMPLIFIERS 

Thus  GROUNDED  Grid  AMPLIFIERS 

most  MULTISTAGE  AMPLIFIERS 

TRANSIENT  RESPONSE  CHARACTERISTICS  OF  Band  PASS  amplifiers 
iUNED  AMPLIFIERS 

transient  response  of  equivalent  VIDEO  AMPLIFIER 

PC  THAN  CLASS  AB  AMPLIFIERS 
RESPONSE  OF  BAND  PASS  AMPLIFIER 
DISTRIBUTED  AMPLIFIER 
ENVELOPE  RESPONSE  OF  AMPLIFIER 
circuit  diagram  of  an  a  m  p  i  i  f  i  e  r 
performance  of  rand  pass  amplifier 
more  STAGES  OF  VIDEO  AMPLIFICATION 

circui 

CIRCUIT  DIAGRAM  OF  AN  AMPLIFIER 
DESIGN  OF  OUTPUT  COUPLING  CIRCUITS 
GROUNDED  CATHODE  CIRCUIT 
equivalent  plate  circuit  of  figure 
ANALYSIS  of  circuits 
CIRCUIT  CONNECTION 
SIMPLE  CIRCUITS 

COMMON  CAUSE  OF  CIRCUIT  CONDI! ION 

RES°UN 

RESPONSE  OF  BAND  PASS  AMPLIFIER 
ENVELOPE  RESPONSE  OF  AMPLIFIER 
ENVELOP  RESPONSF  OF  band  PASS  amplifier 
Same  transient  response  character i st i cs 

ENVELOPE  RESPONSE  CHARACTER! S M CS  OF  BAND  PASS  AMPLIFIERS 
TRANSIENT  response  CHARACTERISTICS  of  Band  pass  amplifiers 
POSSIBLE  RESPONSE 

DE  l  FRM INaT l ON  OF  ENVELOPE  RESPONSE 

TRANSIENT  RESPONSE  CF  EQUIVALENT  VIDEO  amplifier 


Fig,  5-1  Index  for  Texi  l 
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_  s 

P ICTUh 

COMPLETE  picture 

NEW  PICTURE 

PICTURE  SIGNAL 

voltage  of  picture  SIGNAL 

NEXT  PICTURE 

SORTS  OF  PICTURES 

ACTUAL  PICTURE oTRANSKISS  SYSTEMS 

bright  of  colcr  TV  PICTURE 

signal 

strength  of  speech  signal 
PICTURE  SIGNAL 
voltage  of  picture  signal 
ORIGINAL  spf-ech  signal 
slow  variations  of  signal 

CONSICERASLF  STRETCH  of  signal 
of  SIGNAL 

DESCRIPTION  OF  SIGNAL 
LIMITATIONS  of  source  signal 
analog  SIGNAL 
speech  signal  sounds 

SPEECH 

ELECTRICAL  REPLICA  OF  SPEECH 
SPEECH  QUALITY 

speech  transmission 

STRENGTH  OF  SFEECH  SIGNAL 
MOST  COMPLICATED  SPEECH  SOUNDS 
SPEECH  SIGNAL  SOUNDS 
INTELLIGIRLF  SPEECH 

FINE  TEMPORAL  STRUCTURE  CF  SPEECH  WAVE 
ORIGINAL  SPFECH  SIGNAL 
SPEECH  TRANSMISSION  PROBLEM 
SPEECH  SOUNDS 
ENTROPY  OF  SPEECH 

VOCODE 

UNNATURAL  SOUND  OF  CHANNEL  VCOODE 
EVEN  IMPERFFCT  VGCCPE 

TRANSMITTING  ANALYZER  and  RECEIVING  SYNTHESIZER  UMTS  OF  VOCODE 

CHANNEL  VOCODE  OF  FIGURE  VI  1-4  NEEDS 

SORT  OF  VOCODE  DESCRIBED  SENDS  INFORMATION 

MOST  VOCODE 

FORMANT  TRACKING  VCCGCc 
VOCODE  QUALITY 
COST  CF  VOCODE  EQUIPMENT 
VOCODE  WAY  OF 
VOICE-EXCITED  VOCODE 


Fig.  5-2  Index  for  Text  II 
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SVN.N/Gfc 

DEVELOPMENT  OP  ORDERLY  THEORY  OF  SYNNOETICS 

MAN 'MAN  SYNNOESIS 

BRANCH  OF  SYNNOETICS 

ETYMOLOf.Y  OF  SYNNOETICS 

SUBJECTS  OF  SYNNOETICS 

IMPLEMENTATION  CF'SYNNOETIC  SYSTEMS 

PURE  AND  APPLIEO  SYNNOETICS 

THEORY 

THEORY  OF  PRACTICE  OF 
STUDY  CF  THEORY 

DEVELOPMENT  OF  ORDERLY  THEORY  OF  SYNNOETICS 
Fig.  5-3  Index  for  Text  HI 
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ATOMIC 

CFRTaIN  ATOMIC 

MINDS  OF  ATOMIC  PHYSICISTS 

ATOMIC  H 0 M B 

ATOMIC  ARMAMENT 

F U TURF  OF  ATOMIC 

ATOMIC  SCIENTISTS 

ONE  Of  ATOMIC  c*ri:h;S 

ATOMIC  AIR  PAID 

atomic  physicist 
ATOMIC  WEAPON 
ATOMIC  L  A  R  0  P  A  T  0  P  I  r  S 

rain  of  atomic  bcm~s 
ATOMIC  TfcST  E>PLCS;  i  \ 

OF  ATOMIC  PHYSICISTS 
NO  ATOMIC  SFGPtT 

direction  of  NEW. ATOMIC  industry 
federation  of  atomic  SCIENTISTS 

new 

NEW  LEGISLATIVE  PRCPCSaL 
NEW  LEGISLATION 

DIRECTION  OF  NEW  ATOMIC  INDUSTRY 
MFMRERS  of  new  SCIENTIFIC 
DANGEROUS  EFFECTS  CF  new  POWER 
OF  NEW  kjmd  OF  WEAPONS  DEVELOPMENT 
NEW  FRIENDS 
NEW  PCKFR 

SINGLE  ONE  OF  NEW  BOXES 
SCIENTISTS*  VISION  OF  NEW  WORLD 

PHYSIC 

MINDS  OF  ATOMIC  PHYSICISTS 
OF  ATCM1C  PHYSICISTS 

young  American  physicist 

LEARNED  PHYSICIST 

ATOMIC  PHYSICIST 

ONE  OF  AMERICAN  PHYSICISTS 

Three  physicists 

YOUNG  PHYSICIST 

SC  I  EM 

GROWING  TENDENCY  OF  SCIENTISTS 
ATOMIC  SCIENTISTS 

young  scientims 

SCIENTISTS*  BLOOD 
OF  SCIENTISTS 

DOZEN  OF  YOUNGER  SCIENTISTS 
INDIRECT  METHOD  of  SCIENTISTS 
SCIENTISTS*  VISION  of  new  WORLD 
MEMBERS  OF  NEW  SCIENTIFIC 
FEDERATION  OF  ATOMIC  SCIENTISTS 


Fig,  5-4  Index  for  Text  IV 
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ITEM  FORMAT 

multiword  item 
single  list  item 
successor  or  ITEM 
MULTIWORD  item  CONCEPT 
LONGER  ITEM 

only  one  item 

concept  of  MULTIWORD  list  item 
specific  item 

address  of  succeeding  ITEM 
LOCATION  of  PARTICULAR  item 
ONLY  ONE  TWOWORD  item 
simple  FORM  OF  MULTIWORD  ITEM 
POSSIRL6  ITEM  OF  TYPE 
OF  ITEM 
TWORD 

ITEMS 

DELETION  OF  items 
LIST  ITEMS 

MEMORY  LOCATIONS  OF  FIRST  WORD  OF  TWQ  IT£mS 
SUCCESSIVE  LIST  ITEMS 

MOST  SIGNIFICANT  CONTRIBUTION  OF  MULTIWORD  ITEMS 
THEN  VARIABLELENGTH  ITEMS 
MULTIWORD  ITEMS 

INEFFICIENCIES  OF  S I NGLFWORD  ITEMS 
CONNECTED  SFQUENCE  OF  ITEMS 
FOUR  ITEMS 

CONSIDERABLE  MANIPULATION  OF  SEQUENCE  OF  LIST  ITEMS 

FILE  OF  ITEMS 

empty 

SPACE  LIST  ITEMS 

NUMBER  OF  SINGLET  lIST  ITEMS 

MULTIWORD  LIST  ITEMS 

TWOWAY  LIST  OF  THREE WORD  ITEMS 

NO  INDIVIDUAL  ITEMS 

USE  CF  MULTIWORD  ITEMS 

LARRE  FILE  OF  ITEMS 

MOST  FREQUENT  LIST  OPER aT J ONS I NS£RT I NG  AND  DELETING  ITEMS 
THREE  TYPES  OF  ITEMS 


Fij;.  5-5  Index  i'or  Text  V' 
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SUCCESS  I VF  LIST  ITEMS 
TWOWAV  list 
simple  list 

SPACE 

DOUHLFT  list 

concept  or  MULTIWORD  list  item 
LIST  h£aD 

LOCATION  OF  HEAD  or  appropriate  list 

SINGLE  LIST  ITEM 

DOUBLET  SPACE  LIST 

LIST  WEPRFSFNTInG  row 

singlet  LIST 

list  structure 

NUMBER  or  singlet  LIST  ITEMS 
SPACE  LIST  PROBLEM 
FUNCTION  of  whole 
SIMPLICITY  OF  LIST  STRUCTURES 
SIZE  or  LIST 
one  list  STRUCTURE 


Fig.  5-5  Index  for  Text  V  (Cent. ) 
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THE  PLATE  AND  OR 10  CIRCUITS 
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THAT  IS  »  THE  COMPLETE  FREQUENCY  RESPONSE  CHARACTERISTIC  OF  A 

resistance  coupled  amplifier  would  be  nearly  identical  to  that  of 

SINGLc  TUNED  AMPLIFIER  IF  THE  NEGATIVE  FREQUENCY  RANGE  COULD  PE 
OBTAINED  IN  A  PRACTICAL  CASE. 
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SUCH  A  FOR.YANT  TRACKING  VOCODER  CAN  BE  USED  TO  TRAhSMlT  SPEECH  WITH  EVEN 
FEfiER  BINARY  DIGITS  PER  SECOND  THAN  THE  CHANNEL  VOCODER  OF  FIGURE  VI 1-4 
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THE  TV  PROdLEM  IS  MUCH  MORE  DIFFICULT  THAN  THE  SPEECH  TRANSMISSION 
PROBLEM. 


PARTLY  *  THIS  IS  BECAUSE  THE  SENSE  OF  SIGHT  IS  INHERENTLY  MORE  DETAILED 
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Fig.  5-  7  Text  II  (Cont.) 
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Fig.  5-8  Text  III  (Cont. ) 
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OR  AGAIN  *  IT  WAS  POSSIBLE  »  AND  THIS  POSITION  WOULD  BE  THE  STRANGEST 
OF  ALL  *  REALLY  ONLY  COMPARABLE  WITH  THE  CONTRADICTORY  DATA  OF  ATOMIC 
PHYSICS  ♦  FOR  ONE  AND  THE  SAME  PERSON  TO  FEEL  PRIDE  AND  SHAME 
SIMULTANEOUSLY. 

Fig.  5-9  Text  IV 
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Fig.  5-9  Text  IV  (Cont. 


WE  WERE  NATURALLY  SHOCKED  BY  THE  EFFECT  OUR  WEAPON  HAD  PRODUCED  .  AND 
IN  PARTICULAR  BECAUSE  THE  BOMB  HAD  NOT  BEEN  AIMED  •  AS  WE  HAD  ASSUMED 
*  SPECIFICALLY  AT  THE  MILITARY  ESTABLISHMENTS  IN  HIROSHIMA  t  BUT 
DROPPED  IN  THE  CENTER  OF  THE  TOWN. 
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WHEN  THEY  READ  IN  THE  NEWSPAPERS  #  AT  THIS  TIME  *  THAT  MEMBERS  OF 
CONGRESS  WuRE  IN  FAVOR  OF  THE  UNITED  STATES  KEEPING  THE  SECRET  OF  THE 
ATOM  BOMB  TO  THEMSELVES  #  THE  PHYSICISTS  WOULD  HAVE  LIKED  TO  RETORT 
THAT  THERE  WAS  NO  ATOMIC  SECRET  WHICH  COULD  NOT  BE  DETECTED  WITHIN  A 
VERY  SHORT  TIME  BY  ANY  NATION  SCIENTIFICALLY  OF  THE  FIRST  RANK . 


THEY  kOULO  HAVE  LIKEO  TO  PRESS  FOR  THE  IMMEDIATE  CONVOCATION  *  ON 
AMERICAN  INITIATIVE  *  OF  AN  INTERNATIONAL  CONFERENCE  ON  THE  CONTROL  OF 
ATOMIC  DEVELOPMENT  •  AS  HAD  BEEN  DESIRED  BY  BOHR  »  SZILARD  AND  THE 
AUTHOR  OF  THE  FRANCK  REPORT. 
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HIS  NEGATIVE  REACTION  TO  THE  B1LL*S  CONTENTS  WAS  SUPPORTED  BY  THE 
LEGAL  FACULTY  OF  HIS  UNIVERSITY  IN  CHICAGO  WHEN  HE  SUBMITTED  THE 
00CUMENT  TO  THAT  BOOY. 

Fig.  5-9  Text  IV  (Cont. ) 
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MANY  of  THE  SCIENTISTS  VERY  SOON  PERCEIVED  THAT  THIS  ASSET  OF 
ACCUMULATEu  ATTENTION  AND  RESPECT  MIGHT  PERHAPS  9E  CONVERTED  INTO 
THE  CURRENT  COIN  OF  A  GENUINE  POLITICAL  INFLUENCE. 
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wHAT  THE  YOUNG  SCIENTISTS  LACKED  IN  POLITICAL  EXPERIENCE  THEY  MADE  UP 
FOR  BY  AN  ENTHUSIASM  ANO  SINCERITY  WHICH  DEEPLY  IMPRESSED  THE 
POLITICIANS  ANO  IN  PARTICULAR  THE  REPRESENTATIVES  OF  THE  PRESS  IN 
WASHINGTON, 


IT  HAS  KNO*N  TnAT  THIS  STRANGEST  OF  ALL  LOBBIES  WAS  FINANCED  ONLY  BY 
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SUCH  A  CONFIGURATION  IS  CALLED  A  LIST  STRUCTURE 
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SUPPOSE  IT  IS  DESIRED  TO  TAKE  AN  ITEM  FROM  THE  SPACE  LIST 
ANO  INSERT  IT  bETWtEN  THE  FIRST  AND 
SECOND  ITEMS  OF  THE  OTHER  LIST. 
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OUT  IN  SUC n  CASES  *  IT  REQUIRES  THE  PROVISION  OF 
SPECIAL  OVERFLOW  PROCEDURES  IF  A  TABLE  BECOMES  FULL 

Fig.  5-10  Text  V  (Com.  ) 


IF  ALL 

LETTERS  have  an  equal  probability  of  occurring  as  the  first 

CHARACTER  Or  A  SYMBOL  »  THE 

SEARCH  TIME  IS  REDUCED  BY  A  FACTOR  OF  2. 
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KEMEVoER  THE  STARTING  POINT  OF  THE  AS  YET  UNUSED  PORTION  OF 
MEMORY  #  AND  To  TAKE  A  NEW  ITEM  AS  IT  IS  REQUIRED. 


PROBABLY  TmE  MOST  SIGNIFICANT  CONTRIBUTION  OF  MULTIWORD  ITEMS 
TO  THt_  PROCESSING  OF  LIST  STRUCTURES  LIES  IN  THE  AVAILABILITY  OF 
MULTIPLE  PGINTcRS. 
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THE  OPERATION  aORD  AlSO 

CONTAINED  SOME  CODED  INFORMATION  ON  THE  SCALING  OF  THE  TWO  OPERANDS 

THE  T » 0 WAY  POINTERS 

aERE  IMPORTANT  FOR  TaO  REASONS. 


SINCE 

PROGRAM  EFFICIENCY  WAS  OF  PRIMARY  IMPORTANCE  ThEY  *ERE  REALTIME 
CONTROL  PROGRAMS  $  CONSIDERABLE  SCANNING  AND  RESCANNING  OF  THIS  LIST 
*AS  DONE  *  TRYING  TO  IMPROVE  ThE  OBJECT  PROGRAMS. 
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thus  #  every 

ITEM  CONTAINS  A  POINTER  FOR  EACH  CHARACTERISTIC  »  CONNECTING  I T  TO 
THE  LIST  WHICH  REPRESENTS  THE  PROPER  VALUE  OF  THAT  CHARACTERISTIC. 

Fig.  5-10  Text  V  (Corn.) 
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FOR  EXAMPLu  # 

SOME  LlNCAK  PROGRAMMING  PROBLEMS  MAY  INVOLVE  A  000  BY  000  MATRIX 
*ITH  ONLY  ft«4  TO  XXX  NONZERO  ELEMENTS. 
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Fig.  5-10  Text  V  (Cont.) 


H 

H 

2 

O 

o 

< 

z 

z 

u 

*-» 

X 

Id 

Ll 

o 

or. 

o 

<f 

z 

o 

X 

Ll 

H-t 

H 

Ll 

li 

o 

3 

CD 

O 

H 

i-t 

3 

O 

z 

CD 

Ll 

X 

X 

o 

-1 

o 

Id 

-J 

DC 

1-  • 

-4-1 

< 

< 

Cl 

c 

X 

Li 

DC  •-» 

o 

i/l 

Ll 

u  < 

u 

_i  Q. 

< 

Ll 

0. 

> 

CD 

X  Ll 

4— « 

Ll 

*-«  !D 

X 

LI 

H 

Ll 

a> 

Y 

L1 

O 

H 

< 

3 

Ll  H 

►- 

X 

QC 

o 

Ld  Id 

H 

» 

2 

►— 

X  o 

1 

in 

O 

Ll 

is  *-• 

»-« 

.X 

ci 

H* 

-J 

Ll  0. 

< 

Li 

L 

o: 

Ll 

Ll  uj 

LI 

O 

<  X 

0. 

< 

L>  1— 

o 

0. 

D 

Li  Ll 

Li 

Ll  *-* 

*-i 

U 

O 

X 

X 

X  <-* 

H 

*- 

H 

% 

u 

z  * 

<3 

H 

>> 

z  • 

H 

JU.HO 

•  Z 

QOJI2 

Ll 

LI 

1-4 

ac  «-• 

H* 

L*  »  O 

id  U 

H 

z  o 

>  *-< 

*— « 

3  Li 

U  L 

X 

0X0 

j*  L 

Q 

z  *+  z 

O  3 

< 

<  1“  < 

X  Ll 

* 


LOCKHEED  MISSILES  &  SPACE  COMPANY 


hi 

STUDIES  IN  PHONETIC  ENGLISH 


G.  STATISTICS  OF  OPERATIONALLY  DM  FINK  I)  HOMONYMS  OF  ELEMENTARY  WORDS* 

B.  V.  Ilhimani,  L.  L.  Earl,  and  R.  P.  Mitchell 


Words  that  arc  pronounced  the  same  but  have  different  spellings  and  meanings,  as 
for  example  pail  and  pale,  generally  called  homonyms,  have  long  been  of  interest  to 
punsters.  Systematic  investigation  of  the  number  and  nature  of  these  words  shows 
that  they  arc  also  of  more  general  and  serious  interest.  Of  the  approximately  5,  700 
"elementary"  words  in  the  dictionaries  studied,1  about  3,  000  can  be  ambiguous  in 
their  spoken  form.  Moreover,  many  of  these  words  are  common  words;  in  the  503 
words  in  Godfrey  Dewey's  word  list11  with  a  text  frequency  of  more  than  20  in  a  sample 
of  100,000,  only  222  words  are  not  part  of  a  homonym  set.  Thus,  homonyms  are  a 
significant  class  of  words  not  to  be  overlooked  in  the  study  of  the  English  language.** 
For  purposes  of  this  study,  a  homonym  set  was  defined  as  a  set  of  different  ortho¬ 
graphic  forms  having  an  identical  phonetic  transcription  as  provided  by  a  specified 
authoritative  source.  Any  member  of  a  homonym  set  is  called  a  homonym.  An  exhaus¬ 
tive  compilation  of  all  such  sets  was  made,  by  computer  program,  from  the  5,  757 

elementary  words  listed  in  the  five  dictionaries  considered,  each  of  which  provides 

4  -8 

an  authoritative  phonetic  transcription. 


♦Supported  by  the  Lockheed  Indeixmdent  Research  Program. 

♦♦According  to  the  2nd  edition  of  Fowler's  A  Dictionary  of  Modern  English  Usage, 

Robert  Bridges  jiublishcd  an  essay  on  homophones  in  1919,  as  Tract  II  of  the  Society 
for  Pure  English,  in  which  he  compiled  lists  of  words  that  are  pronounced  alike  but 
have  "different  origin  and  signification."  His  lists  contained  835  entries  comprising 
l,  775  words  (not  limited  to  one -syllable  words,  and  not  including  words  that  were 
originally  the  same  but  have  acquired  different  meanings),  which  led  him  to  the  propo¬ 
sitions  dial  homophones  are  a  nuisance  and  that  English  is  exceptionally  burdened  with 
them,  lie  pro|>oscd  also,  however,  that  homophones  are  self-destructive  and  tend  to 
become  obsolete,  a  proposition  which  may  be  questioned  in  the  light  of  our  recent 
compilations 
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References  4  through  8,  respectively,  will  be  referred  to  by  the  following  abbreviations: 

•  MW3 

•  KK 

•  ACD 

•  JON 

•  SOX 

The  homonym  sets  were  derived  separately  for  each  dictionary,  so  that  differences 

in  the  phonetic  symbology  of  the  dictionaries  did  not  cause  any  problems.  For  each 

compilation,  all  5,757  elementary  words  listed  were  considered,  even  though  each 

word  did  not  appear  in  all  five  dictionaries.  Before  the  homonym  sets  were  compiled, 

each  pronunciation  of  each  word  was  identified  by  dictionary  source  and  also  by  class 

of  dialect  when  applicable.  For  words  missing  from  one  or  more  of  the  dictionaries. 

the  missing  phonetic  transcriptions  were  generated  by  algorithm  and  marked  with  an 

*  * 

indicator  so  they  could  be  readily  identified  as  special  eases. 


♦SOX  and  JON  represent  speech  patterns  in  Great  Britain;  sometimes  variant  British 
pronunciations  arc  given  in  JON.  The  other  three  dictionaries  represent  speech  pat  ¬ 
terns  in  the  United  Stales.  ACD  represents  the  midwcslcrn  speech  pattern,  with 
occasional  variant  pronunciations  given.  KK  presents  separately  the  pronunciation 
of  words  in  eastern,  southern,  and  midwestern  ''dialects."  MW3  presents  speech  in 
regions  considered  by  KI<  and  also  in  regions  of  New  York  City  (c.g.,  Brooklyn  and 
Bronx) . 

♦♦Instead  of  transcribing  the  phonetics  from  the  dictionaries,  a  highly  accurate  algo¬ 
rithm  (better  than  93  percent  accurate)  was  devised  for  automatically  generating  the 
phonetic  form  for  each  dictionary  from  the  graphic  form.  The  generated  forms  were 
then  checked  against  the  dictionaries,  and  errors  were  corrected.  Corrected  words 
were  marked  with  a  D  indicator.  The  phonetic  representations  of  words  missing  from 
a  given  dictionary  could  not  lx;  directly  checked,  however,  and  were  marked  with  (1) 
an  N  indicator  if  the  algorithm  had  functioned  correctly  in  deriving  the  SOX  phonetics 
of  that  word  or  (2)  an  M  indicator  if  the  algorithm  had  given  incorrect  results  on  this 
dictionary,  in  which  ease  the  probable  error  had  been  corrected.  Thus,  the  M  indi¬ 
cator  is  almost  equivalent  to  an  N  +  D  marker.  The  algorithms  for  generating 
phonetic  transcriptions  are  described  in  two  not-yet  published  manuscripts.  "Acoustic 
Phonetic  Transcription  of  Written  English,"  by  B.  V.  Bhimani  and  J.  L.  Dolby,  and 
"The  Operational  Relation  Between  the  Phonetic  Forms  of  Elementary  Words  "by 
B.  V.  Bhimani  and  R.  P.  Mitchell. 
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Ti.r  .slid islics  i)l  the  homonym  compilation  in  each  of  the  five  dictionaries  arc 
; •  'i*  in  Table  o-i  ani'  graphically  in  Kin.  6-.1 .  (Note  the  10  to  1  change  in  scale  in 
.  0-1  between  sets  of 3  and  sets  of  *1.)  Figure  0-2  is  a  sample  page  from  one  of 

i no  nomonym  printouts.  The  first  three  columns  give  the  graphic  form  split  into 
consonant  and  vowel  strings;  the  next  three  columns  give  the  code  for  the  phonetic 
representation;  the  seventh  column  indicates  the  set  of  algorithmic  rules  by  which 


the  phonetic  representation  was  derived. “  and  the  final  column  indicates  the  source 
of  the  phonetic  data  used.  A  blank  line  separates  the  homonym  sets. 

Table  G-l 


NUMBER  OF  HOMONYM  SETS  IN  FIVE  DICTIONARIES 


Dictionary 

MW3 

KK 

ACD 

JON 

SOX 

No. 

2 

Word  Sets 

1889 

1402 

717 

727 

661 

No. 

3 

WorJ  Sets 

380 

268 

133 

142 

117 

No. 

4 

Word  Sets 

99 

55 

33 

31 

27 

No. 

5 

Word  Sets 

18 

11 

4 

8 

3 

Ni. 

G 

Word  Sets 

9 

5 

2 

0 

0 

No. 

7 

Word  Sets 

1 

1 

0 

0 

0 

No. 

8 

Word  Sets 

1 

0 

1 

1 

0 

No. 

9 

Word  Sets 

0 

1 

0 

0 

0 

No. 

10  Word  Sets 

1 

0 

0 

0 

0 

Surprisingly,  both  the  numlx'r  of  sets  and  number  of  total  words  involved  in 
homonym  sets  differ  considerably  from  dictionary  to  dictionary,  and  a  word  which 
may  be  in  a  homonym  set  according  to  the  phonetic  representation  in  one  dictionary 
may  not  have  a  homonym  according  to  another  dictionary.  Accordingly,  a  homonym 
comparison  table  of  the  5,  757  words  considered  was  prepared  by  a  computer  program, 
showing  in  which  dictionaries  each  word  occurs  in  a  homonym  set.  and  how  many  and 
which  phonetic  representations  were  involved.  Table  G-2  summarizes  the  phonetic 
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Tabic  6-2 


PHONETIC  REPRESENTATION  CODES 

Code  Interpretation  Dictionary 

JON  1  1st  pronunciation  JON 

JON  2  2nd  pronunciation  JON 

ACD  1  1st  pronunciation  ACD 

ACD  2  2nd  pronunciation  ACD 

101SK  Midwestern  pronunciation  Kk 

102SK  First  Variant  pronunciation  KK 

103SK  East  and  South  pronunciation  KK 

104SK  East  pronunciation  KK 

105SK  Second  Variant  pronunciation  KK 

106SK  Third  Variant  pronunciation  KK 

107SK  Fourth  Variant  pronunciation  KK 

101SW  Midwestern  pronunciation  MW3 

102SW  First  Variant  pronunciation  MW3 

103SW  Boston  II  Dropper  pronunciation  MW3 

104SW  Brooklyn  R  Dropper  pronunciation  MW3 

105SW  L  Dropper  pronunciation  M'.V3 

1Q6SW  Second  Variant  pronunciation  MW3 

107SW  Third  Variant  pronunciation  MW3 

108SW  Fourth  Variant  pronunciation  MVV3 

109SW  Filth  Variant  pronunciation  MW3 

20XSW  Consonant  variant  pronunciation 

on  the  lOx  pronunciation  of  MW3 

20XKK  Consonant  variant  pronunciation 

on  the  I  Ox  pronunciation  of  KK 
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codes  used.  Note  that  dictionaries  do  not  all  give  the  same  number  of  phonetic  varia¬ 
tions,  nor  are  their  nhonctic  classes  always  the  same.  SOX  usually  gives  only  one 
pronunciation,  and  therefore  there  arc  no  SOX  entries  in  Table  6-2.  Figure  6-3,  a 
sample  page  ft om  the  homonym  comparison  table,  indicates,  for  example,  that  the 
word  fon  is  involved  in  a  homonym  set  only  according  to  the  MW3  pronunciation.  Yet 
the  word  fort  is  involved  in  six  MW3  homonym  sets,  four  KK  sets,  one  .’ON  set,  one 
ACD  set,  and  no  SOX  set.  (In  general,  SOX  has  the  fewest  homonyms,  indicating 
perhaps  that  the  SOX  phonetic  transcription  system  is  finer.) 

The  total  number  of  words  in  the  homonym  comparison  table  is  2,966,  showing 
that  2,966  of  5,757  words  arc  in  a  homonym  set  according  tr  a*  least  one  dictionary. 
Thus,  over  50  percent  of  the  elementary  words  arc  ambiguous  in  their  spoken  form. 
The  homonym  comparison  table  points  up  two  significant  findings,  the  apparent  dis¬ 
parity  among  dictionaries  and  the  large  i>erccntagc  of  elementary  words  distinguished 
in  the  graphic  but  not  the  spoken  form  (as  recorded  by  the  dictionaries). 

Before  exploring  the  possible  reason,  for  the  disparity  in  homonym  sets  according 
to  the  dictionary  from  which  derived,  some  possibilities  can  be  eliminated.  First, 
since  all  these  dictionaries  were  published  at  approximately  the  same  time,  and  since 
it  is  generally  recognized  that  their  contents  arc  periodically  ujx.ated,  historic  vowel 
changes  are  not  expected  to  cause  discrepancies.  Also,  vowels  which  are  consistently 
pronounced  one  way  according  to  one  dicationary  and  another  way  (but  always  the  same 
other  way)  according  to  a  second  dictionary,  will  affect  the  homonym  compilation  very 
little.  For  example,  break  and  brake  are  homonyms  whether  the  vowel  is  given  a 
Hi  itish  pronunciatiar  as  indicated  by  "b  r  e  i  k"  in  JON  or  an  American  pronunciation 


% 
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f 

I 

NN 

ldSW 

101SK 

JONl 

ACDl 

SOX 

r 

1 

9 

lOlS* 

103SK 

JON  1 

ACDl 

sox 

102>M 

10?SH 

101SK 

r 

I 

RN 

JONl 

r 

J 

RTH 

101SM 

103SK 

101SK 

JONl 

*cr»i 

sox 

f 

I 

SC 

101  CM 

101  SK 

JONl 

ACDl 

sox 

f 

l 

SK 

IOINm 

101SK 

JONl 

ACDl 

SOX 

r 

l 

T 

?01Sm 

101SM 

101SK 

JONl 

ACDl 

sox 

r 

l 

71 

lfllSM 

101SK 

JONl 

ACDi 

Sox 

r 

IE 

101SM 

ioisk 

JONl 

ACDl 

sox 

F 

IE 

L 

101NM 

,_lQ2N* 

101NK 

JONl 

ACDl 

sox 

F 

0 

H 

101SM 

101NK 

JONl 

ACDl 

sox 

F 

0 

MN 

l01*M 

lOlNK 

ACDl 

F 

0 

JJJ _ 

JONl 

F 

0 

N 

lnisw 

F 

0 

NE 

101NW 

1  OINK 

JONl 

ACDl 

sox 

F 

0 

NT 

_  1015m 

_ A£Ql_. 

F 

0 

R 

1020m 

1010m 

1030m 

1  010K 
103OK 

JONl 

JON2 

JONl 

ACD2 

F 

0 

RD 

1  •'  3Sm 

:o«sm 

F 

0 

_ £6 _ 

IOISm 

10 1  SK 

JON2  ... 

ACDl 

sox 

IIUSm 
10<Sm 
_ 1Q2S_M _ 

103SK 
102SK 
104SH _ 

F 

0 

RT 

id*  Sm 
203Sm 

101  SK 

_ _1  Q.3JS  K _ 

JONl 

ACDl 

1 0  3Sm  10?SK 

2  n  4  s>4  ioask 

-UaSm  „  _ 


r  0 

RTE 

102SM 

iOliM _ 

_ 101DK 

_ JON*.. 

_ ACDl _ 

70  35m 

1  03Sm 

103DK 

10?qK 

- _ - 

1 C  *  5m 
lOiSM 
2CASM 

1Q4D* 

— 

r  o 

PTn 

1015m 

1  03  lM 
10?Sm 

1  n  1  SK 
injsK 

1  0  ?  SK 

JONl 

ACDl 

SOX 

1’iR.  <> -•$  Samplr  l*aut*  »f  Homonym  Comparison  Tublti 


»i-S 


lockmeed  missiles  a  space  company 


ns  iwiii  au  d  by  "b  r  c  k"  in  KK.  The  list  below  gives  the  phonetic  "tymbols  for  this 
sound  from  each  of  the  five  dictionaries  and  the  corresponding  code  used  for  machine 
purposes.  (JON  and  KK  use  the  International  Phonetic  Alphabet.) 


SOX 

b  r  c*  k 

B  R  121419  K 

JON 

b  r  e  i  k 

B  R  E  1  K 

ACD 

bra  k 

B  R  A  4  K 

KK 

b  r  c  k 

B  R  E  K 

MW3 

b  r  a  k 

BRA  1  K 

Thus,  consistent  changes  from  dialect  to  dialect  will  not  cause  significant  discrepancies 
in  homonyms. 

What  then  will  cause  discrepancies  from  dictionary  to  dictionary?  When  several 
dialects  arc  considered  together  in  the  compilation  o;  homonyms,  as  in  KK  and  MW 3, 
extra  homonym  sets  or  larger  sets  c  n  be  produced  across  the  dialects.  For  instance, 
two  words  which  are  not  homonyms  in  cither  the  southern  or  eastern  dialects  may 
become  homonyms  when  the  southern  pronunciation  of  one  is  compared  with  the  eastern 
pronunciation  of  the  other.  By  removing  the  dialect  pronunciations  from  the  homonym 
sets,  two  objectives  are  met: 

«  The  ambiguity-producing  effects  of  dialects  are  shown, 
o  Homonym  disparities  between  ACD  and  KK  or  MW3  which  result  from  the 
inclusion  of  dialects  are  removed. 

In  removing  dialects,  some  difficulty  is  encountered  in  identifying  true  dialectal  pro¬ 
nunciations.  The  103SK,  1G4SK,  20XSK  (where  X  is  any  number),  103SW,  104SW, 
105SW,  30XSW,  and  20XSW  pronunciations  (Table  6-2)  were  considered  to  be  true 
dialects  by  the  dictionaries  in  which  presented  and  were,  therefore,  removed  by  com¬ 
puter  program  from  the  homonym  sets.  The  homonym  comparison  program  was  run 
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again  on  the  homonyms  after  the  removal  of  the  dialectal  pronunciations  to  produce 
another  homonym  comparison  table  of  the  same  form  as  shown  in  Fig.  G-3.  The  results 
show  the  expected  reduction  in  the  number  of  sets  containing  a  given  word  and  in  the 
number  of  words  that  appear  in  homonym  sets,  but  these  reductions  are  not  as  large 
as  was  expected . 

To  show  the  relationships  among  the  five  dictionaries  from  the  point  of  view  of  the 
involvement  of  the  words  in  homonym  sets,  some  statistics  of  homonym  membership 
were  compiled  and  are  given  in  Table  G-3.  Since  the  statistics  were  compiled  from 
the  homonym  comparison  tables, which  were  compiled  before  and  after  the  removal  of 
the  dialects,  the  effect  of  the  dialect  removal  is  shown.  Note  that  with  the  dialects 
removed  the  number  of  elementary  words  which  are  in  homonym  sets  is  reduced  only 
about  5  percent,  from  52  to  about  47  percent.  Note  also  that  the  relationships  among 
the  various  sets  named  in  Table  G-3  not  change  significantly .  In  particular,  the  ratio 
between  the  words  forming  a  homonym  in  all  dictionaries  and  the  words  forming  a 
homonym  in  any  dictionary  changes  only  from  0.5074  to  0.5467  when  dialects  are 
removed.  Thus,  the  dialects  arc  not  the  main  reason  for  the  large  number  of  homonyms, 
nor  are  they  the  major  cause  of  discrepancies  among  the  dictionaries. 

It  is  also  revealing  to  consider  the  actual  occurrence  of  ambiguity  introduced  by 
the  dialects,  and  because  they  are  not  numerous  we  have  prepared  tables  which  give 
them  all.  In  Table  6-4,  Part  A  shows  all  new  sets  introduced  by  the  dialect  pronuncia¬ 
tions  of  KK;  Part  B  shows  all  words  or  sets  added  to  nondialcctal  homonym  sets  by  a 
dialect  pronunciation  of  KK.  The  starred  items  were  not  removed  by  the  program  but 
seemed  to  the  authors  to  be  dialect  forms  and  were  removed  later.  Table  6-5  shows 
all  the  dialectal  pronunciations  removed  from  MW3,  but  here  we  have  divided  them 
into  nine  significant  categories. 
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Table  6-3 


* 


STATISTICAL  SUMMARY  OF  WORDS  INVOLVED  IN  HOMONYM  SETS, 
SHOWING  EFFECT  OF  DIALECT  REMOVAL 


Set  Description 

No.  of  Words  in  Set 

With  Dialects 

Without  Dialects 

Total  Set: 

Words  forming  a  homonym 

2966 

2714 

in  at  least  one  dictionary 

Words  forming  a  homonym 

746 

l 

.63.6  ’ 

in  one  dictionary 

Words  forming  a  homonym 

236 

214 

in  two  dictionaries 

Words  forming  a  homonym 

1  N<) 

1H4 

in  three  dictionaries 

Words  forming  a  homonym 

•.MWl 

in  four  dictionaries 

Words  forming  a  homonym 

1  606 

14H4 

in  all  dictionaries 

Words  forming  a  homonym  in  SOX 

17.64 

17,64 

Words  forming  a  homonym  in  ACD 

1937 

1937 

Words  forming  a  homonym  in  JON 

2039 

2039 

Words  forming  a  homonym  in  MW3 

2600 

2297 

Words  forming  a  homonym  in  KK 

2140 

2096 
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WORDS  INVOLVED  IN  HOMONYM  SETS  IN  KK 
BECAUSE  OF  DIALECTAL  PRONUNCIATIONS 


Part  A 

Dictionary 

Orajihic: 

Phonetic 

(  ’ode 

MUZZ 

MA(iZ 

HUNK 

MUS 

20 1  NK 

DAZE 

DEZ 

1  o  |  SK 

DASE 

.'0|  NK 

0  KEITH 

Cl  RE  PI 

201  NK 

CRAKT1I 

101  ilk 

NAIS 

NEZ 

201  NK 

MAZE 

1  o  i  \  K 

CLEAR 

KLE 1  E2(R 

|o.,SK 

CLARE 

lolhK 

HEAR 

RE1E2(R 

1 05SK 

RAKE 

101SK 

MY 

ME  2 

1 OU  or  101SK 

or 

MAC 

Ml 

1 00  or  100SK 

BROOSE 

IU1Z 

202NK 

BRUISE 

102DK 

CIIKSE 

T$I  1 Z 

20  INK 

CHEESE 

101  NK 

CROZE 

KROZ 

101  NK 

CHOSE 

20  INK 

SHORE 

$0E2(R 

1 04  or  lOUSK 

SURE 

105  or  102SK 

LAPSE 

HOIZ 

201  NK 

HAWSE 

10  INK 

BROOSE 

BRU1Z 

201  NK 

BRUISE 

101  DK 

COUTH 

KU1P1 

10  INK 

COOT  If 

201 MK 

J  EER 

I)Z  1 E 1  E2(R 

105SK 

•OEKR 

I)Z  1  E 1  E2(R 

105SK 

JEER 

DZ  l  E  l  K2(K 

105SK 

(i.ii 

VI  •  a» 
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Tabic  G-4  (Cent. ) 


Dictionary 


Graphic 

Phonetic 

Code 

%  EE  Alt 

EE  1  K2(It 

105SK 

EKE  It 

EKlK2(It 

105SK 

„  ELK  Alt 

ELE1  E2(It 

105SK 

ELK  Ell 

ELE 1  K2(It 

105SK 

UK  Alt 

1IK1K2(R 

107SK 

•IIKKIt 

IS  El  E2(R 

107SK 

HKItK 

11E1  E2(lt 

108SK 

.  LEAR 

LEI  E2(R 

105SK 

LEER 

LEI  E2(R 

105SK 

TEAK 

TE1  E2(R 

10f.SK 

•TKER 

TE1E2(R 

105SK 

TIE  It 

TK1K2(R 

105SK 

.WEIR 

WEI  E2(R 

105SK 

WERE 

WEI  E2(R 

105SK 

* TROTH 

TRA3P1 

105SK 

TROUGH 

10GDK 

.BUM 

BAGM 

101SK 

BOMB 

Part  B 

102SK 

NEEZE 

NIIZ 

10  INK 

•WERE 

WAl  E2(R 

107SK 

•OUR 

A2U(R 

105SK 

♦EAR 

ElE2(R 

10”»SK 

.BIER 

BEIE2(R 

105SK 

BEER 

BEl  E2(R 

105SK 

•BLEAR 

BLEIE2(R 

105SK 

.DEER 

DEIE2(R 

105SK 

DEAR 

DEI  E2(R 

ior>sK 

’KIEIt 

KEIE2(R 

105SK 

•MEER 

ME  1  E2(R 

105SK 

•PEER 

PE  1  K2(R 

105SK 
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Table  6-4  (Coni.) 


Graphic 


Phonetic 


Dictionary 

Code 


,SPKAR 

SPHl  K2(R 

i  o:,sk 

SPEEIiE 

SPK1  K2(R 

105SK 

♦CIIEEIt 

T$E1K2(R 

1  OliSK 

♦AND 

K2N 

1 OGSK 

♦WEAR 

WIE2(R 

106SK 

♦POOR 

POE2(R 

1 05SK 

♦PRYSE 

PRAIZ 

201SK 

♦BLOUSE 

I1LAUZ 

201SK 

♦CLOUGH 

KLA2F 

11WDK 

♦DON 

DA3N 

HWSK 

♦WOT 

WAliT 

103SK 

♦SHARE 

$El  E2(R 

103SK 

♦CERE 

SEl  K2(R 

104SK 

♦ERR 

E3(R 

103SK 

♦YA1R 

JE1E2(R 

104NK 
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Table  G-5 

WORDS  INVOLVE’  IN  HOMONYM  SETS  IN  MW3  BECAUSE 
OF  DIA  XTAL  PRONUNCIATIONS 


Set  A 


PUT 

DROWTE 

SATE 

SNOT 

WET 

CLEAT 

LIT 

QUOTE 

PUD 

DRAD 

SADE 

SNOD 

WED 

CLEAI) 

LID 

QUOD 

NEWT 

CLOUT 

SLATE 

TROT 

CHUT 

LEASE 

MITT 

TOTE 

NUDE 

CLOUD 

SLADE 

TROD 

CHAD 

LEESE 

MOD 

TOAD 

FAT 

CROUT 

TRAIT 

BET 

GLUT 

PLEAT 

WRIT 

BROI  0,1  IT 

FAD 

CROWD 

TRADE 

BED 

BLOOD 

PLEAD 

ROD 

BROAD 

GAT 

LOUT 

BRAT 

BKUTTE 

HUT 

SPETE 

SKIT 

DRAUGHT 

GAD 

LOUD 

BROD 

BUI) 

HUD 

SPEED 

SKID 

BRAUD 

MAT 

PLATE 

DOT 

FET 

CRUT 

TWEET 

FRIGHT 

SQUAT 

JAI) 

BLADE 

DOD 

FEI) 

CRUD 

'I  WEED 

FRIED 

SQUAD 

CAT 

DATE 

CLOT 

GET 

MUTT 

WEET 

KRAIT 

SHAT 

CAD 

DADE 

CLOD 

GED 

MUD 

WEED 

CRIED 

SWAD 

GNAT 

DASE 

POT 

KET 

SHUT 

WHIT 

PIGIIT 

WATT 

NAD 

DAZE 

POD 

KED 

SHOULD 

WIIID 

PIED 

WAD 

PAT 

SOOT 

PLOT 

PET 

SCUT 

BRIT 

SNITE 

FEUTE 

PAD 

SUD 

PLOD 

PED 

SCUD 

BRID 

SNIDE 

FOOD 

PLAT 

CADE 

SOT 

STET 

STUT 

GRIT 

TIGHT 

HOOT 

PLAID 

CATE 

SOD 

STEAD 

STUD 

GROD 

TIDE 

HOOD 

RAT 

PATE 

SKOT 

THREAT 

BLEAT 

KIT 

TRITE 

MOOT 

RAD 

PAID 

SHOD 

THREAD 

BLEED 

KID 

TRIED 

MOOD 

WAT 

RATE 

SQU  AT 

TRET 

CHESE 

QUIT 

CROSE 

FOOT 

WAS 

RAID 

SQUAD 

TREAD 

CHEESE 

QUID 

CROZE 

FOOD 

Z 
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Table G-3  (Coni.) 


Set  B 


SI  lit  AG 

CARVE 

SUPINE 

MON 

SWAG 

CALVE 

sw:ne 

MUM 

ClIEItT 

CLART 

SHRIVE 

MONT 

CHAT 

CLAUT 

SW1VE 

MENT 

IIAULSE 

MARL 

SOURCE 

PURSE 

HOUSE 

MALL 

PSOAS 

PUS 

11  AULT 

PARSE 

FAULT 

SIIONCi 

HOUT 

PASS 

FOUGHT 

SHUN 

GOLF 

SCARP 

GAULT 

THIS 

GOFF 

SCAUP 

GHAUT 

THUS 

ARSE 

SMARM 

SURE 

AL 

ASS 

SMALM 

SHIRR 

ILE 

BARGII 

SPAR 

SPEARE 

DEE 

BAFF 

SPA 

SPHERE 

DIT 

BARM 

TAR 

SAVLE 

LA 

BALM 

TA 

SERVE 

LAW 

BARSE 

HEARSE 

UGH 

DRAUGHT 

BASS 

L'JSS 

HER 

DROUTH 

BARTH 

SIR 

FUM 

THE  E 

BATH 

SO 

FROM 

THY 

CHAR 

SEER 

DUD 

TIE 

CllA 

SEA 

DID 

TAIL EE 

DART 

SHRIFT 

GUN 

K1NE 

ix)  r 

SWIFT 

GON 

KIN 

GAR 

SHRILL 

HUFF 

FETCH 

GAW 

SHILL 

HAVE 

REACH 

HAl'GH 

SHRINK 

iiU/.Z 

ILL 

HARK 

SW1NK 

HAS 

ILE 

Ait 

jar 
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Tabic  G-f>  (Cent. ) 


Set  B 


A'r 

WADE 

CUT 

DID 

CODE 

HAT 

KOI) 

NOU  LI) 

FID 

LOTE 

11  HAT 

GOD 

PUD 

GED 

NODE 

DRAD 

QUAT 

HUT 

KIT 

TO  IX) 

LAI) 

NOD 

Sll>OT 

C1D 

SI  It)  AT 

MAD 

HOD 

ESE 

BIDE 

boot 

SCAD 

SWAD 

FEED 

BRIDE 

BKOU1) 

BLOUSE 

TOD 

GLEET 

GUIDE 

LEUI) 

FADE 

WAD 

GREED 

HIDE 

HARD 

CADE 

BAWD 

NEAT 

SIDE 

CARD 

G  HADE 

KET 

REIT 

SICE 

SAB) 

HADE 

SAID 

CEASE 

SLIDE 

BIDE 

LATE 

IDE 

SWEDE 

WIDE 

BRIDE 

MATE 

BUD 

IT 

OAT 

GUIDE 

SPADE 

EUD 

BID 

BODE 

BIDE 

Set  1) 

HALT 

1IARHE 

Ml 

•CAR) 

•MORE 

AH 

ILVHM 

( 

•HOLD 

•HOW 

DAKK 

CARE 

AYAH 

•HAULM 

•YOUR 

CLAD 

MAH 

SOY 

•HORSE 
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Tabic  6-5  (Cont,) 


Sct_E_ 

Set  11 

AIT 

WRITE 

SWATH 

EIGHT 

RIGHT 

SWATHE 

EYGIIT 

GOAT 

FART 

AUGHT 

GOTE 

FAD 

GUTTE 

MODE 

SPOUT 

GOT 

GIIAIT 

MOD 

SPALD 

BOUGHT 

SWEAT 

GPETT 

BOTT 

SHRED 

BRET 

ROOD 

GIRT 

DEBT 

RUDE 

GJRD 

PETTE 

Sot  F 

CURT 

LET 

CURD 

LE IT 

BAR 

BARR 

WORT 

GUT 

GUTTE 

PAR 

WORD 

PARR 

GIRT 

BEAT 

BEET 

EARN 

BIRD 

URN 

CURT 

HEAT 

CURD 

I1ETE 

Set  G 

CEAT 

LEET 

CHAD 

SURD 

BEAT 

DOWD 

TIT 

METE 

MEET 

BIRT 

TEAT 

MEAT 

HERD 

SORT 

SWORD 

CETfi 

SEAT 

FORD 

Set  1 

NIGHT 

CORT 

GHAUT 

KNIGHT 

WARD 

GALT 
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Set  A.  New  homonym  sets  in  which  a  pronunciation  of  type  20X  is  involved.  These 
reflect  confusion  between  T  and  P  or  S  and  Z  sounds,  which  may  not  be 
strictly  a  dialectal  phenomenon. 

Set  B.  New  homonym  sets  in  which  a  pronunciation  of  the  type  20X  is  not  involved. 

Set  C.  Words  in  which  a  pronunciation  of  the  type  20X  adds  one  to  the  number  of 
homonyms  in  a  nondiaiectal  homonym  set. 

Set  D.  Same  as  C,  except  a  non-20X  dialectal  pronunciation  is  responsbile  for  an 
extra  member  of  a  homonym  set.  (Starred  items  were  added  by  hand  as  in 
Table  6-4. ) 

Set  E.  New  homonym  sets  caused  by  a  pronunciation  of  the  type  20X,  where  each 
of  these  sets  has  the  same  pronunciation  as  a  nondiaiectal  homonym  set. 
Thus,  these  words  add  more  than  one  member  to  a  nondiaiectal  set. 

Set  F.  Same  as  E,  except  a  non-20X  dialectal  pronunciation  is  responsible  for 
the  extra  members  to  homonym  sets. 

Sot  G.  Words  in  which  a  dialectal  pronunciation  causes  confusion  with  words 

already  in  sets  B  or  D.  Thus,  a  dialectal  pronunciation  of  chert  causes 
the  homonym  sci  chert,  chat.  A  dialectal  pronunciation  of  chad  adds  to 
the  set,  making  it  chert,  chat,  chad . 

Set  II.  New  homonym  sets  in  which  two  dialectal,  variations  combine  to  form  a 
homonym  group. 

Set  I.  New  homonym  sets  in  which  two  dialectal  variations  combine  to  form  a 
homonym  group,  where  each  of  these  groups  has  the  same  pronunciation 
as  a  nondiaiectal  homonym  set. 

To  summarize  cur  results,  it  has  been  shown,  using  phonetic  representations 
from  five  dictionaries,  that  approximately  half  of  the  elementary  words  of  English  arc 
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ambiguous  according  Lo  at  least  one  dictionary,  and  that  this  figure  is  not  significantly 
changed  by  removal  of  predefined  dialectal  pronunciations.  The  words  whose  dialectal 
pronunciations  have  affected  the  homonym  sets  have  been  listed.  Discrepancies  in 
homonym  data  among  the  five  dictionaries  have  been  made  apparent.  It  has  been 
indicated  that  neither  historic  vowel  changes  nor  consistent  vowel  changes  can  be 
considered  to  be  a  major  cause  of  these  discrepancies.  Also,  it  has  been  shown  that 
the  dictionary-defined  dialectal  vowel  variations  account  for  only  a  small  proportion 
of  these  discrepancies. 
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7.  ACOUSTIC  PHONETIC  TRANSCRIPTION  OF  WRITTEN  ENGLISH* 

B.  V.  Bhim ani  and  J.  L.  Dolby 


INTRODUCTION 

The  current  spelling  of  an  English  word  is  the  symbolization  of  a  traditionally 
preserved  form  of  its  pronunciation,  even  though  it  may  seem  to  be  on  imperfect 
representation.  We  shall  investigate  the  accuracy  of  this  representation  by  a  detailed 
examination  of  all  of  the  one-syllable  words  given  in  the  Shorter  Oxford  Dictionary.* 

In  particular,  we  shall  show  that  it  is  possible  to  construct  a  computable  algorithm 
that  provides  the  correct  phonetic  representation  (according  to  the  source  dictionary) 
for  93  percent  of  these  words,  given  only  the  written  form  of  the  word. 

The  essential  feature  of  this  algorithm  is  that  it  makes  use  of  what  we  shall  here 
call  the  "marking  system  of  written  English."  In  some  writing  systems  explicit 
markers  are  used  to  indicate  vowel  duration  and  stress.  Sanskrit,  for  instance,  uses 
a  phonetic  alphabet  and  numerals  for  indication  of  vowel  duration  along  with  markers 
for  stress.  Thus  the  English  word  ALMS  would  be  represented  as  3-T| 3  • 

In  French,  diacritic  markers  are  used  for  similar  effects  (e.g. ,  HOPITAL). 

In  English,  however,  the  orthography  is  limited  to  the  26  letters  of  the  alphabet 
and  the  marking  system  is  more  subtle.  One  well-known  feature  of  ihe  English  marking 
system  is  the  use  of  the  final  12  (represented  here  as  #)  as  a  marker  operating  on  the 
preceding  vowel  string.  Proper  use  of  the  markers  is  necessary  in  written  English 
for  phonetic  transcription  of  its  words. 

The  initial  restriction  of  this  study  to  the  one-syllable  words  of  English  was  made 
to  enable  us  to  study  the  marker  system  for  precise  transcription  of  vowel  articulation 

♦This  work  was  supported  by  the  Lockheed  Independent  Research  Program. 
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and  duration  and  to  resolve  certain  consonantal  ambiguities  without  analyzing  the 
added  complications  introduced  by  the  stress-marking  system  necessary  for  polysyl¬ 


labic  words.  The  stress-marker  system  of  written  English  is  intimately  connected 

2  3 

with  the  systems  for  carrying  important  grammatical  signals.  ’  As  has  been  noted 
4 

elsewhere,  the  one-syllable  words  (except  for  the  small  but  important  set  of  structure 


words)  are  generally  grammatically  homogeneous. 


THE  SCHEMATIC  STRUCTURE  OF  THE  ALGORITHM 


The  algorithm  considered  here  has  been  programmed  on  a  digital  computer.  This 
device  uses  a  limited  number  of  symbols.  As  a  result,  it  was  necessary  to  replace 
the  phonetic  coding  system  of  the  Shorter  Oxford  Dictionary  (SOX)  with  a  set  of  alpha¬ 
numeric  codes  acceptable  to  the  machine.  These  codes  are  given  in  Fig.  7-1.  It  will 
be  noted  that  the  transformation  from  the  dictionary  codes  to  the  machine  codes  is  one- 
for-onc  so  that  no  essential  information  is  lost  by  this  step  in  the  procedure.  Moreover, 
the  alphanumeric  codes  were  chosen  to  ensure  that  all  possible  codes  given  by  the 
dictionary  would  be  representable.  Only  38  of  the  150  codes  actually  occurred  in  the 
one-syllable  words. 

The  algorithm  itself  is  shown  in  schematic  form  in  Fig.  7-2.  The  first  step  con¬ 
sists  of  a  simple  classification  of  the  written  symbols  into  the  system  of  graphemie 
vowels,  consonants,  and  markers  given  in  Fig.  7-2.  In  this  system,  a  final  %  is 
classed  as  a  marker,  all  other  occurrences  of  E  together  with  all  occurrences  of  A,  I , 
O,  U,  and  Y  are  classed  as  (graphemie)  vowels.  All  remaining  characters  are  classed 
as  (graphemie)  consonants. 

Step  two  consists  of  an  analysis  in  context  to  resolve  consonantal  ambiguities  such 
as  those  which  occur  with  the  graphemie  C  and  G.  In  the  third  step,  the  letter  string's 
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Code  Markings 


Symbol  |  Symbol  |  No 

Marking 


O 

0 

§ 

u 

u 


Break 

Inv.  Per 


5 

6 

5 

6 

5 

6 

5 

1 

G 

5 

6 

5 

6 

5 

6 

5 

6 

5 

6 

5 

6 

5 

6 

5 

6 

5 

6 

Inv. 

Period 

Super 

Letter 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

8 

9 

H 

l 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

64 

65 

a.  Cooing  of  Vowel  Sounds  and  Pertinent  Markings  for  Pronunciations 
of  English  Words 


Fig.  7-1  Alphanumeric  Coding  for  the  Phonetics  of  the  Shorter  Oxford  Dictionar 
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KERNEL  WORD 


Phonetics 
as  i  n 
Shorter 
Oxford 
Dictionary 


Fig.  7-2  Algorithm  in  Schematic  Form 
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arc  processed  through  the  rules  of  euphonic  combination,  reported  previously,  to 
change  the  letter  strings  into  consonant  strings,  vowel  strings,  and  markers.  Step 
four  consists  of  the  necessary  rules  to  resolve  the  vowel  ambiguities,  which  are  the 
main  portion  of  the  problem.  In  step  live  the  vowel  markers  are  usud  to  transform 
the  graphic  symbols  into  the  phonetic  symlxds  us  given  by  SOX. 

An  illustration  of  the  operation  of  the  program  is  given  in  Fig.  7-3  lor  the  word 
NICE.  Figure  7-4  illustrates  the  processing  of  the  word  SMUDGE.  A  typical  page  of 
computer  output  for  the  phonetics  of  SOX  is  given  in  Fig.  7-5.  The  first  column  is 
the  orthographic  form  of  the  word,  the  second  column  is  the  compiled  phonetic  repre¬ 
sentation,  and  the  third  column  8|>ecilio8  the  rule  used  for  the  resolution  of  the  vowel 
ambiguities.  The  resulting  phonetic  codes  were  checked  individually  against  the  source 
dictionary  and  correction  cards  (identifiable  by  the  English  words  following  the  asterisk) 
and  were  added  to  the  output  deck  where  errors  appeared  (see,  for  instance,  BLAF  in 
Fig.  7-5). 

A  total  of  407  errors  were  detected  in  the  5,757  one-syllable  words  given  in  the 
source  dictionary.  Some  of  these  errors  were  a  result  of  errors  in  the  syllable 
counting  routine  used  to  obtain  the  one-syllable  words  from  a  magnetic  tape  listing  of 

g 

SOX.  The  word  BLASE  is  the  one  example  of  this  sort  shown  in  Fig.  7-5.  Many  of 
the  remaining  errors  (such  as  BLAE)  occurred  in  obscure  words  and  words  of  limited 
current  interest.  To  obtain  a  quick  check  on  the  exacted  accuracy  of  the  program  on 
words  of  greater  usage,  a  random  sample  of  50  words  was  chosen  from  the  subset  of 
the  ono-syllabio  words  having  a  standard  meaning  in  both  the  source  dictionary  and 

7 

Webster's  Third  International  Dictionary.  Only  one  error  was  found  in  this  sample 
(the  vowel  of  CIIASS  was  .ncorrectly  equated  to  the  vowel  of  BRASS). 

In  the  remainder  of  the  paper  we  discuss  the  derivation  of  the  rules  nccossary  to 
resolve  the  various  ambiguities  of  written  English. 
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NICE/KERNEL  WORD 


Marker 

i 

N/Consonant  -*  I/Vowcl  + 

-----  —I 

C/Consonant.  +  Marker  L 

i _ 

Interpretation  of  Consonant  barkers  | 

_ i 

N/Consonant  +  1/ Vowel  * 

S/Consonant  +  Marker  E 

Rules  of  Euphonic  Combination 

N/Consonant  +  1/ Vowel  + 

Marker  E  +  S/Consonant 

Interpretation  of  Vowel  Markers 

Phonetics 
as  in 
Shorter 
Oxford 
Dictionary 

NE213 

ns  is 


Fig.  7*3  Operation  of  the  Program  for  the  Word  NICE 
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SMUDGE/KERNEL  WORD 


SM/Consonant  String  ♦  U/ Vowel  *  DG/Consonant  Siring  >  Marker  E 


SM /Consonant  String  •>  U/Vowel  >  I)  !XZ1 /Consonant  String  ■*  Marker  K 


SM/Consonant  String  +  U/Vowel  Vowel  Marker  1)  ♦  Marker  E  *  IjZ  1/Consonant 


Phonetics 
as  in 
Shorter 
Oxlord 
Dictionary 

SMAdDZ  1 

sm  od* 

Fig-  7-4  PnM  cssmg  «»{  the  Wont  SMl'DGE 
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ORTHOGRAPHIC 

PHONETIC 

OPERATING 

FORM 

RENDERING 

RULES 

SLACK 

HLA1K 

•  S 

✓  .53  .  A  ♦  *. 

HLAO 

MLAlll 

•  < 

✓  .53  .  A  ♦  •. 

blade 

BLFJ4I9D 

•  s> 

/  .  39  *  A  ♦  • , 

RLAF 

OLFlAlB 

•  S 

/  ,44  ,  A  ♦  •. 

HLAfe 

RLE14 

•VOKEL  ERRORS 

Rlajn 

»LflAI9N 

•s 

/  ,2  ,  Al  ♦  •. 

flL  AKF 

BLflA J9K 

•s 

/  .18  .  A  ♦ 

PL  AMF 

»LflAl9K 

•s 

/  ,3ft  •  A  •  • , 

Blanch 

BCAlNt 

•s 

/  .S3  ,  A  ♦  •. 

Blanch 

Bl A?NS 

•SINGULARITIES 

«L  ANO 

Bl  A  INI) 

•s 

/,«(*•*. 

blank 

PLA1N1K 

•s 

t  ,5J  ,  A  •  •. 

Blare 

BLE4F29B1 

•s 

/  ,  5  »  A  •  • . 

BlaS 

BLAiSOBZ 

•s 

/  .  S3  #  A  •  • . 

blase 

B| f 1419S0BZ 

•s 

/  .3#  f  A  •  *. 

blase 

Bl  a? , ZE l 

•BOLTSTlL ASIC 

Blast 

BLA?ST 

•s 

✓  .33  .  A  ♦  •. 

BtATE 

BlHAIBT 

•s 

/  ,3ft  #  A  *  • « 

81  av 

BlFIASB 

•s 

/  ,4  ,  AT  ♦  •. 

Blatk 

BlE 14J9K 

•s 

/  ,4  ,  AT  •  #. 

BLAZE 

BlE 1 4 | 9Z 

•s 

/  ,3ft  »  A  ♦  •. 

BlFaCm 

Bt I 1 4  T  t 

•s 

/  ,4ft  ,  FA  ♦  • 

BlEak 

HLll 4K 

•s 

/  .4*  ,  EA  •  • 

BlEaB 

«L I4F29B1 

•s 

/  ,3ft  ,  *rA  •  • 

BlEaT 

Bl  1 1 4 1 

•s 

/  ,46  .  EA  ♦  • 

BlEb 

rlFh 

•s 

/  ,15  »  E  ♦  • . 

RLEC* 

bl  E  k 

•s 

/  ,1.5  .  fc  •  •• 

blEe 

»L  1 14 

.  z 

/  .  2  ,  f  E  •  • . 

bleed 

BL I 1 40 

•s 

/  ,2  .  FE  *  *. 

RlEnCn 

fl  L  E  N  f 

*s 

/  .15  .  E  • 

blend 

blEnp 

•s 

/  .15  .  F  ♦ 

BLEnOE 

BlFMI 

•s 

/  .15  .  E  ♦  •• 

RLENK 

blFnik 

•  s 

/  .? 17  ,  F  •  • 

blent 

blent 

•s 

✓  .15  .  F  • 

BlEBE 

t  ; «F29«t 

•s 

t  ,  ft  #  6  ♦  • . 

BlESS 

rles 

•s 

✓  .13  .  E  •  •. 

BLEST 

blest 

•s 

/  .15  .  E  *  •. 

BlFT 

blFt 

•s 

/  ,15  »  E  •  •. 

BL  1  C  K 

BL  IK 

•5 

/  ,14  ,  I  •  *. 

»L ICMT 

BLF2JT 

•s 

,  4  ,  I  •  •  . 

BL  IN 

BL  IN 

•s 

/  .14  ,  1  ♦  • » 

«L  I NP 

BLF2INH 

•  s 

/  .5  .  1  •  •. 

BLINK 

Bl Ini • 

•s 

/  .14  ,  I  •  •. 

BL  t«T 

BLF24B1T 

•s 

/  .2  .  !  ♦  • . 

Bl  ins 

■L  is 

•s 

/  .14  .  1  • 

BlI’F 

®le?  I  ’ 

•s 

/  .3  .  Iff  •  * 

Bl I ’ "i 

BLF2I01 

•s 

t  .3  .  t»k  *  • 

bloat 

BlOi«u9T 

•s 

/  .7  ,  0*  •  •. 

blob 

Bl«1  A.3 

•s 

/  .92  .  0  • 

Block 

*»L*M  M 

•s 

/  .92  .  0  *  •. 

bl  jk 

BLOlAK 

•s 

/  .92  .  0  •  •. 

BlOAF 

Bl  0  t 40  9K 

«t 

/  .•!  .  0  • 

BLOND 

BlO}  ANf> 

•s 

/  ,92  .  0  •  •. 

blood 

Bl*  TO 

•$ 

/  .1  .  CO  •  •. 

•loo** 

•LOT «" 

•s 

/  ,  A  »  00  •  • * 

BlOO*<* 

•lOI *0l0*Bt 

•i 

/  .9  .  00  *  •. 

BvOBf 

BLf>4f  ?|l) 

•$ 

/  .64  .  0  •  •  . 

•lot 

•L**!  A? 

♦J 

•lOtc* 

BLnl A  T t 

•  4 

/  .92  .  0  •  *. 

•  lOTI 

•LOtAwfT 

t  ,ftl  .  0  •  •. 

•w  T»»ca]  Computer  Out|>ut  tor  Phonetics  of  the  Shorter  Oxford  Dictionary 
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THE  CONSONANT  STRING  MAPPING 


As  noted  in  Reference  4,  it  is  always  possible  to  represent  the  graphemic  form 
of  a  one-syllable  word  in  the  form  CVC  where  C  represents  a  string  of  consonants 
and  V  represents  a  string  of  vowels.  The  only  conventions  necessary  to  accomplish 
this  are  the  conventions  whereby  the  final  #  is  treated  as  a  marker  and  words  beginning 
or  ending  with  a  vowel  are  augmented  with  the  ’’blank  consonant"  <£  .  The  blank  con¬ 
sonant  is  also  ured  in  the  phonetic  form  of  the  word.  The  relation  between  the 
graphemic  and  phonetic  forms  of  the  word  can  then  be  studied  as  a  composite  of  the 
three  mappings  that  carry  !■  initial  consonant  string,  the  vowel  string,  and  the  final 
consonant  string  from  the  graphemic  form  to  the  phonetic  form.  For  instance,  if  we 
consider  the  word  STRAIGHT  we  obtain  the  following  triple  mapping: 

Graphemic  Phonetic 


Initial  Consonant  String 

STR 

STR 

Vowel  String 

AI 

E14I9 

Final  Consonant  String 

GHT 

(GH)/ 

MARKER7 


The  complexity  of  this  mapping  then  becomes  the  central  issue.  The  writing  sys¬ 
tem  of  English  might  be  considered  purely  phonetic  if  one  could  find  that  every  written 
form  with  the  initial  consonant  string  STR  would  also  have  STR  as  its  initial  phonetic 
consonant  string.  An  even  more  stringent  requirement  for  an  "ideal"  writing  system 
would  require  that  each  symbol  (rather  than  each  string)  map  into  a  unique  phonetic 
symbol.  It  would  also  be  convenient  if  this  map  were  invertible.  That  is,  if  graphemic 
STR  and  only  STR  mapped  into  phonetic  STR.  However,  numerous  examples  exist  to 
show  that  this  is  not  the  case  (e.  g. ,  graphemic  F  and  graphemic  PH  both  into  phonetic 
F).  However  desirable  this  might  be  from  the  viewpoint  of  the  linguist,  it  is  clear  that 
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it  is  not  in  keeping  with  the  written  English  form.  However,  the  use  of  the  subtle 
marking  system  accomplishes  these  objectives,  as  discussed  next. 

When  we  assign  a  specific  phonetic  value  for  a  specific  graphic  symbol,  only  two 
consonants  lead  to  ambiguous  situations;  namely  the  graphic  consonants  C  and  G. 

The  former  maps  into  either  phonetic  K  or  phonetic  S,  and  the  latter  maps  into  either 
phonetic  G,  or  DZ1.  There  is,  however,  a  subtle  consonant  marking  system  that 
readily  resolves  this  ambiguity  with  a  very  high  degree  of  accuracy,  and  it  is  tabulated 


next. 

Graphic  Phonetic 

If  C  is  followed  by  A,  O,  U  C  K 

Otherwise,  C  S 

If  G  is  followed  by  E  G  DZ1 

Otherwise  G  G 


Such  mappings  of  graphic  to  phonetic  values  for  the  initial  strings  produce  the 
correct  phonetic  form  in  all  but  58  of  the  one-syllable  words  of  SOX.  The  58  errors 
include  those  words  with  "uncommon"  consonant  strings,  the  words  where  the  C  or  G 
rule  fails,  and  those  words  where  such  specific  mapping  fails  on  the  consonant  strings. 
The  most  notable  case  of  the  latter  set  is  the  graphic  string  TH  which  maps  into  PI 
or  D1  and  the  only  simple  algorithm  suggested  works  for  about  90  percent  of  the  cases, 
and  it  is  to  map  initial  TH  into  PI. 

The  mapping  for  the  terminal  consonant  strings  is  similar  to  that  for  the  initial 
strings;  however,  it  becomes  necessary  to  treat  separately  the  largo  number  of  strings 
that  are  Indicated  as  being  difficult  to  pronounce  by  tho  ruler  of  ouphonlc  combination. 
For  this  reason,  tho  terminal  consonant  strings  arc  first  mapped  into  corresponding 
phonetic  consonant  strings  in  a  maimer  similar  to  that  described  for  llu  initial  strings. 
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The  resulting  phonetic  strings  are  processed  by  rules  of  euphonic  combination  and 
separated  into  pronouncable  consonant  strings  and  consonants  that  act  as  vowel  markers 
to  be  used  in  the  mapping  of  the  vowel  strings.  Such  a  processing  provides  accurate 
mapping  of  all  but  3.55  words  in  the  set  of  one-syllable  words  studied,  and  it  identifies 
the  vowel  marking  consonants  to  be  discussed  under  vowel  string  mapping. 

THE  VOWEL  STRING  CAPPING 

Table  7-1  shows  the  possible  phonetic  vowel  strings  for  each  of  the  i!)  graphic 
vowel  strings,  after  uncommon  strings  have  been  removed.  The  only  graphic  strings 

r1 

that  provide  a  specific  phonetic  map  are: 


Graphic 

Phonetic 

AI 

E1419 

EY 

E1419 

01 

01 

OY 

01 

For  most  other  cases,  it  becomes  necessary  to  use  the  vowel  marking  consonants 
identified  in  the  processing  of  the  terminal  consonant  strings;  one  of  the  important 
exceptions  being  the  initial  consonant  W  which  influences  pronunciation  of  the  following 
vowel  as  evident  in  the  pronunciations  of  the  words  AS  and  WAS. 

Since  a  detailed  listing  of  all  the  necessary  maps  to  resolve  ambiguity  in  the  other 
cases  would  be  of  iimited  interest,  we  will  here  content  ourselves  with  a  few  examples 
to  show  how  tho  vowel  marker  system  operates  in  the  simpler  cases.  The  strings  IvE 
and  OA,  for  instanco,  illustrate  the  importance  of  a  following  U  as  a  marker.  In  both 
cases  the  potential  ambiguity  is  resolved  by  the  presence  or  absence  of  a  following  U. 
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Tabic  7-1 


GRAPHIC  TO  PHONETIC  MAPPINGS  OF  VOWEL  STRINGS 

Graphic 

String  Phonetic  Strings 

A  A,  Ai,  A2,  A4,  E14I9,  E4E29  ,  0164,  OG,  064 

AI  E14I9 

AU  A2,  064 

AY  AI,  E14I9 

E  E,  E 1419 ,  E24,  E4E29,  IU14,  IH,  U14 

EA  E,  E14I9,  E24,  E4E29,  114,  I4E29 

EE  114,  I4E29 

El  E14I,  E 1419 ,  E2I,  E4E29,  114 

EY  E14I9 

I  E2I.  E2IE29,  E24,  I,  1H 

IE  E2I,  E2IE.  E2IE2,  114,  I4E29 

O  A2,  A2U,  A3,  A34,  O,  01,  014U9  ,  016,  0164,  04,  04E29,  06,  U14 

OA  014U9,  04E29 

OA  01 

00  U,  l_14,  U4E29 

OU  A2U,  A2UE29,  A3,  014U9,  016,  04E2U,  064,  U,  Ul,  U14,  U4E29 

91  91 

U  A3,  A34,  IU14,  IU4E29.  I9U14,  U,  UH 

Y  E2I,  K21E29,  I_ 
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Graphic 


Phonetic 


If  E12  is  followed  by  R 

EE 

I4E29 

Otherwise 

EE 

114 

If  OA  is  followed  by  R 

OA 

04E29 

Otherwise 

OA 

014U9 

In  such  cases  R  induces  the  so-called  visarga  vowel,  as  has  been  noted  in  Reference  8. 
Thus  R  can  act  as  a  "silent"  consonant  used  only  for  the  purposes  of  marking  vowels 
just  as  does  the  final  E_,  the  DG  form,  the  GH  of  the  GUT  form,  and  so  forth. 

To  get  a  more  complete  picture  of  the  vowel  marking  system,  consider  the  graphic 


vowel  E. 

If  E  is  followed  by  RE 

E 

E4E29 

Otherwise  if  E  is  followed  by  R 

E 

E24 

If  E  is  followed  by  W0  and  preceded 

by  L  or  R 

E 

U14 

Otherwise  if  E  is  followed  by  W 

E 

IU14 

Otherwise  if  E  is  followed  by  a 

single  consonant  which  is,  in 

turn,  followed  by  E 

I 

Lii 

Otherwise 

E 

E 

This  rule  provides  the  correct  results  for  all  but  seven  of  the  words  in  this  set.  Note 
that  the  final  E  applies  only  when  thoro  is  but .  single  consonant  preceding  it,  and  that 
the  W  marker  is  modified  both  by  a  following  blank  consonant  and  a  preceding  jL  or  R. 

In  essence,  the  markers  must  be  used  in  conjunction  with  one  another  by  way  of  n  set 
of  precedence  relations,  and  this  may  in  part  be  responsible  for  the  general  feeling 
that  English  orthography  does  not  present  a  nice  means  of  representing  phonetic  values. 

7-H 


LOCKHEED  MISSILES  ft  SPACE  COMPANY 


Overall,  the  mapping  of  specific  graphic  strings  into  corresponding  specific 
phonetic  strings  with  the  necessary  use  of  vowel  markers  provides  the  correct  result 
in  all  but  151  of  the  one-syllable  words  of  SOX.  Thus  the  total  mapping  of  the  con¬ 
sonant  strings  and  vowel  strings  provides  the  correct  phonetic  value  in  all  but  363  of 
the  5,757  one-syllable  words  when  comparcu  to  the  phonetic  values  given  by  SOX. 

SUMMARY 

When  one  considers  the  mappings  from  the  orthographic  forms  of  the  elementary 
v/ords  into  their  corresponding  phonetic  forms,  as  in  the  Shorter  Oxford  Dictionary, 
it  becomes  apparent  that  the  consonant  mappings  seem  straightforward,  but  that  proper 
processing  of  these  strings  by  the  rules  of  euphonic  combination  identifies  the  pro- 
nouncable  consonants  and  the  consonants  that  function  as  vowel  markers.  By  such 
identification  of  the  vowel  markers,  it  is  possible  to  obtain  a  highly  accurate  mapping 
of  the  graphic  vowel  strings  into  their  corresponding  phonetic  values  with  complexity 
that  does  not  exceed  the  mapping  of  any  specific  graphic  string  into  a  corresponding 
specific  phonetic  string.  When  such  a  mapping  is  constructed,  some  93  pcrcont  of  the 
elementary  words  are  interpreted  correctly.  The  residue  consists  primarily  of  obscure 
forms,  important  structure  words  that  have  a  unique  spoiling  (relative  to  their  pro¬ 
nunciation)  such  as  the  word  ARK .  and  a  set  of  ambiguous  forms  that  can  only  bo  resolved 
by  examining  the  surrounding  context  of  the  word  in  r,  given  usage,  such  as  the  words 
BOW  and  HOUSE. 

With  this  analysis  we  see  that  (at  least  for  the  elementary  words)  the  English 
orthography  is  a  highly  developed  phonetic  system  that  provides  information  about  the 
precise  pronunciation  of  the  consonants  and  vowels  ami  the  duration  of  the  vowels  and 
certain  consonants.  Other  work  indicates  that  the  necessary  stress  information  is  also 
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available  in  the  graphic  forms.  No  other  published  phonetic  system  in  use  at  present 
can  claim  to  accomplish  this  without  the  use  of  a  very  large  population  of  phonetic 
symbols  in  addition  to  the  necessary  incorporation  of  diacritic  markers  for  indication 
of  duration  and  stress. 
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8,  COMPUTER  STUDY  OF  TRANSCRIBED  ENGLISH  PHONETICS: 

A  PROGRESS  REPORT 


R.  P.  Mitchell 

This  report  is  intended  to  be  a  brief  summary  of  an  important  phase  of  the 
computer-oriented  research  within  the  field  of  English  phonetics  which  is  now  in 
progress  at  the  LMSC  Information  Sciences  Laboratory.  The  Laboratory  research 
effort  in  English  phonetics  is  not  restricted  to  the  work  reported  here;  it  also  includes 
studies  of  the  techniques  of  generalized  speech  spectrum  analyzers  and  the  instru¬ 
mented  palate.  Nevertheless,  the  work  reported  here  is  basic  and  indispensable  to 
the  total  plannee,  program  in  English  phonetics  at  the  Laboratory. 

This  report  is  not  a  detailed  "work  paper."  Rather,  it  is  a  summary  of  background 
and  progress;  its  purpose  is  to  indicate  the  nature  of  the  research  effort  in  sufficient 
detail  that  its  significance  can  be  evaluated.  Detailed  research  results  have  boon 
reported  elsewhere,  and  are  contained  in  the  papers  listed  in  the  references. 

Of  prime  importance  to  an  adequate  tinders  landing  of  this  research  is  the  role  of 
machine  processing  of  data.  Unlike  many  applications  of  data -processing  techniques 
in  linguistics  the  role  of  the  computer  hero  is  not  primarily  the  role  of  "accountant." 
To  be  sure,  we  are  handling  a  fairly  large  group  of  .lata,  and  these  data  are  growing 
larger  in  both  volume  and  complexity,  so  that  accounting  sendee#  of  some  sophistica¬ 
tion  are  always  needed;  but  the  part  which  the  computer  it  asked  to  play  is  not  limited 
to  this  role- 

The  computer  is  used  primarily  to  compute.  !!*  inputs  are  not,  r«  course,  con¬ 
ventional  numeric  data  as  they  would  be.  lor  example,  for  an  orbit  trajectory  computa- 
lion.  In  this  application,  the  computer  inputs  are  Hollerith  charade  s,  and  remain 
alphameric#  throughout  the  processing  cycle.  For  the  basic  data -generation  program#. 


* 
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the  inputs  are  elements  of  a  subset  of  English  words  in  conventional  spelling  form  and 
the  outputs  are  phonetic  transcriptions  of  the  inputs  As  the  research  project  now 
stands,  what  takes  place  within  each  processing  cycle  is  the  computation  of  the  English 
word's  pronunciation  according  to  a  given  recognized  authority;  this  computation  is 
efficient,  accurate,  and  requires  no  external  input  other  than  the  conventional  form  of 
the  English  word.  The  processing  cycle  is  not  a  trivial  one  of  matching  inputs  with 
elements  in  a  large  memory-core;  it  is  a  nontrivial  cycle  involving  a  comjHitation  of 
the  phonetic  transcriptions  of  the  input -word  using  ail  the  relevant  information  contained 
in  the  graphemic  structure  of  the  word. 

In  addition  to  the  basic  data-genoration  programs,  as  much  of  the  analysis  of  the 
phonetic  data  as  is  possible  is  carried  out  by  various  formatting  ami  counting  programs. 
These  programs  fail  largely  within  the  category  of  "accounting  services."  They  ate 
essontial,  though  unspectacular,  services  which  the  computor  is  asked  to  perform. 

At  this  writing,  the  subset  of  English  word*  from  which  basic  phonetic  data  has  been 
generated  is  the  set  of  elementary  words  as  defined  by  Dolby  ami  licsnikoff.^  which  is 
a  set  of  somu  5700  words.  The  pronunciations  of  each  of  these  words  has  been  deter¬ 
mined  for  each  of  five  authoritative  sources,  so  that  our  actual  set  of  data  consists  of 
some  50. 000  entries.  This  is  a  sizeable  file  io  maintain  and  edit  as  it  now  «taml*.  but 
our  immediate  plans  are  to  externa  our  results  to  the  set  of  words  which  contain  one 
qnd  only  one  phonetic  vowel  according  to  the  transcriptions  in  the  Shorter  Oxford 

m  UTi  I  n'lll  » 

Dictionary  This  would  increase  the  input -word  set  to  about  a,  P00  entries,  with  a 
corresponding  minimum  of  PO.  000  phonetic  entries  To  process  files  of  alphameric 
data  of  this  sUe.  without  extravagant  use  of  computer  lime,  requires  extremely  effi¬ 
cient  and  reliable  data -processing  techniques 

In  this  respect,  the  research  on  programming  lechniqwe®  carried  out  elsewhere 
m  the  Information  Sciences  Laboratory  has  been  most  helpful  in  this  project.  Two  ol 
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. .  the  basic  data-gone ration  programs  were  originally  coded  in  CUAUM,  developed  by 
Lois  Earl  of  this  Laboratory  for  the  purpose  of  processing  linguistic  data.  CHARM 
is  easy  to  learn,  cosy  to  use,  and  efficient  for  files  of  medium  size.  Many  of  our 
programs  were  coded  in  MPL1,  a  multipurpose  language  developed  by  Roger  Stark, 
also  of  this  Laboratory.  MPL1,  which  uses  the  XPOP  compiler,  generates  extremely 
efficient  codings  for  processing  files  of  all  size*.  For  our  formatting  and  analysis 
programs,  MPL1  was  found  to  be  a  most  helpful  tool. 

The  basic  source  of  data  input  for  this  project  is  the  word-list  generated  by  Dolby 
and  Hesnikoff;  the  set  of  elementary  words,  defined  by  the  same  authors  in  Reference  1, 
was  chosen  as  a  basic  set  with  which  to  investigate  relationships  between  orthographic 
and  phonetic  representations  of  English  words.  The  choice  was  governed  partly  by  the 
fact  that  this  set  is  easily  defined  in  terms  of  orthographic  structure,  and  partly  also 
by  the  fact  that  it  is  a  sufficiently  large  set  of  lexed  Items  v’i  which  to  begin  the 
study.  D.  V.  Bhimani,  Information  Sciences  Laboratory  consultant  and  the  principal 
investigator  for  |>honclics  research  at  the  Laboratory,  initiated  this  investigation  and 
has  carried  it  through  to  its  present  form. 

The  set  of  elementary  words  corresponds  approximately  to  the  set  of  ail  lexed 
items  of  the  form  (consonant  stringRvowel  stringRconsonanl  string).  In  this  definition, 
the  set  of  graphic  vowels  contains  the  elements  a,  e>  i..  o,  a.  ami  y,  final  e,  denoted 
by  the  symbol  4  .  I*  an  element  of  the  set  of  nenvowcls,  or  consonants;  and  either  or 
both  of  the  consonart  strings  mry  be  empty,  hut  the  vowel  siring  is  noa-emjity .  For 
convenience,  in  this  paper  we  use  parentheses  to  mark  oil  orthographic  elements 
The*.  Wi .  (tniuf)  .  and  are  ail  orthographic  elements,  twie  that  there  is 

a  difference  between  the  element*  Cx>.  )  and  (x)(y)(r>  ;  in  the  former,  we  moan  to 
indicate  consHleraUoe  of  the  string  **xyx**  a*  an  integral  element,  while  »•  she  latter 

f 

!»*3 


LCCXHtro  rs  a  sower  com  os  mv 


we  intend  to  consider  the  contingent  concatenation  of  the  graphic  elements  (x)  ,  (y)  , 
and  (&)  .  The  sot  of  graphic  vowels  is  denoted  by  V  and  tho  set  of  graphic  nonvowcls 
is  designated  by  C  .  > 

In  this  notation,  the  set  (X)  ,  where  X  is  either  C  or  V  ,  is  the  set  of  all 
elements  (x)  ,  (x(w))  ,  ((x)w)  ,  (xw)  ,  (x(w{z)))  ,  fx((w)z))  ,  . . . ,  whore  x  ,  w  ,  and  z 
are  elements  of  either  C  or  of  V  ,  but  not  both.  The  work  of  Dolby  and  Rcsnikoff  on 
the  graphemic  structure  of  English  words  is  basically  the  study  of  written  English  in 
terms  of  sequences  of  the  form  (C)(V)<C)( V)(C). .  .<C)  .  In  other  words,  a  lexed  item, 
except  tor  convenience  of  reference,  is  never  considered  ns  a  distinct  unit  of  the  form 
fCVCVC. . .  C)  . 

Approximately,  than,  the  study  of  elementary  words  is  the  study  of  strings  of  the 
form  (C){V)(C)  ,  with  the  study  restricted  first  to  those  lexed  items  in  tho  Shorter 
Oxford  Dictionary  which  can  be  represented  as  an  element  of  the  set  (C)(V)(C)  ,  and 
second  to  certain  statistical  restraints  which  eliminate  obscure  words  and  unusual 
consonantal  combinations.  There  was,  however,  a  third  restraint  operating  in  this 
investigation  -  viz. ,  the  general  purpose  of  the  investigators  in  attempting  to  discover 
relationships  between  the  orthographic  structure  of  words  anil  their  possible  gran  - 
maticai  properties.  To  obtain  as  nearly  grammatically  homogeneous  partitionings  of 
the  set  (C)(V)(C)  as  possible,  the  authors  were  led  to  criteria  which  eliminate  a  small 
hut  important  sot  of  lexed  items  strictly  elements  of  <C>< V)(C)  .  On  the  whole,  how¬ 
ever,  the  set  of  elementary  words  as  defined  by  Dolby  and  Rcsnikoff  is  representative 
of  English  (€)(V)(C)  words. 

Tho  question  of  whether  there  exist  computable  relations  between  the  graphic 
representation  of  English  words  and  their  sots  of  pronunciations  is  a  topic  of  consuloraldu 
intrinsic  and  economic  interest.  It  is  not  necessary  here  to  elaborate  on  the  possible 
important  implications  of  an  affirmative  answer  to  this  question.  To  begin  to  make 
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the  question  a  sensible  scientific  problem  and  further  a  meaningful  computational 
problem,  there  must  exist  phonetic  data  in  some  reasonably  tractable  form.  A  major 
part  of  linguistic  science  has  always  been  concerned  with  various  aspects  of  the  problem 
of  representing  phonetic  data.  We  thus  have  the  various  phonetic  alphabets,  phonemo 
studies,  the  notion  of  "distinctive  features,"  and  so  on,  as  attempts  to  give  a  certain 
structure  to  an  apparently  otherwise  unstructured  mass  of  data.  As  nearly  as  we  can 
determine,  and  without  questioning  the  scientific  merit  of  these  efforts,  the  kinds  of 
structure  which  they  induce  upon  the  data  arc  not  computationally  practicable,  even 
when  they  happen  to  be  computable  in  principle.  It  seems  reasonable  to  assume  that 
if  the  question  is  meaningful  at  all,  it  is  meaningful  at  some  lower  level  of  structure 
among  the  several  which  have  appet  the  literature.  Such  a  level  is  exemplified 

by  the  pronunciations  of  words  as  recorded  ned  transcribed  for  various  dialects  of 
English  by  different  authorities.  Clearly,  there  is  no  "proper"  operational  definition 
of  a  pronunciation  of  a  word,  and  just  as  clearly  there  exist  certain  limits  within  which 
communication  by  means  of  oral  articulation  of  the  word  is  signified.  There  is  an 
automatic  interplay  within  the  communication  process  of  the  various  aspects  of  speech 
production,  perception,  recognition,  and  representation,  to  name  but  a  few  of  Iho 
larger  aspects  of  the  process.  There  do  exist  transcriptions  uf  speech  which  attempt 
to  represent  the  pattern  of  speech  as  perceived  by  the  transcriber  and  communicate 
the  perceived  patient  to  others  (within  t'ic  limits  of  the  phonetic  alphabet  used).  These 
iranscr  piions  have  inherent  limitations,  and  are  clearly  not  intended  to  bo  reproduc¬ 
tions  oi  speech  in  the  same  sense  that  tape  recordings  and  wavo  forms  arc  physical 
representations  of  the  speech  patterns  they  reproduce. 

Thu  value  of  transcribed  phonetics  to  the  question  wc  are  considering  lies  in  the 
fact  that  the  transcriptions  constitute  "intermediate  data"  within  this  question.  Not  ail 
of  the  physically  significant  features  of  speech  production  arc  represented  in  the 
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transcription,  but  enough  of  tho  phonetically  significant  features  of  tho  speech  pattern 
arc  present  in  quality  transcriptions  to  make  the  pattern  meaningful  as  data.  Our 
first  task,  therefore,  was  to  discover  whether  there  exist  relations  botwcon  the  graphic 
structure  of  lexed  items  and  the  phonetic  transcriptions  of  those  items  as  transcribed 
by  recognized  authorities.  If  the  answer  should  be  negative  in  any  practical  sense, 
then  the  problem  of  relating  phonetic  and  graphic  structures  must  rely  for  its  solution 
upon  analysis  of  some  lower  level  of  phonetic  data.  On  the  other  hand,  if  the  answer 
should  be  affirmative,  mid  if  the  relations  turn  out  to  be  practicably  computable  ns 
well,  then  the  problems  of  analysis  and  synthesis  are  simplified  and  a  valuable  inter¬ 
mediate  structure  is  made  available. 

An  authoritative  source  of  phonetic  transcriptions  of  lexed  items  lay  in  the  various 
phonetic  dictionaries  and  the  conventional  dictionaries.  Bhimani,  using  work  completed 
before  joining  the  Lockheed  consulting  staff,  was  able  to  construct  an  algorithm  relating 
tho  orthographic  structure  of  elementary  words  to  their  corresponding  phonetic  tran¬ 
scriptions  in  the  Shorter  Oxford  Dictionary  with  ‘‘3  pert  ant  accuracy.  This  result  was 
reported  in  Reference  3.  The  term  "1)3  percent  accuracy"  means  that,  on  the  average, 
tho  algorithm  will  fail  to  yielo  the  Shorter  Oxford  phonetic  transcription  for  7  words 
out  of  ICO.  This  result  was  sufficiently  encouraging  that  he  turned  his  attention  to  the 
problem  of  constructing  similar  algorithms  for  oilier  authoritative  transcriptions. 

Using  the  results  for  the  Shorter  Oxford  algorithm,  he  obtained  algorithms  for  the 
phonetic  transe "iptionr  of  f'ue  so*  of  word  *  aec  o:  iir.g  it  (i  t  other  authorities,  repre¬ 
senting  individually  and  collectively  several  sets  of  dialectal  variants  of  spoken  English. 

The  success  of  this  work  depended  ujion  a  proper  ureters' aiv'fng  jf  the  rules  of 
euphonic  combination  for  English  and  the  manner  in  which  conventional  a, celling  utilizes 
graphic  symbols  to  indicate  vowel  duration  and  stress.  A  well-known  example  of  th.s 
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"marking  system"  is  the  i  marker.  In  the  word  "scrape,"  for  example,  the  algorithm 
first  splits  the  word  into  its  graphic  CVC  form,  then  looks  for  markers  which  (if 
present)  yield  a  modified  pronunciation  of  the  word.  Thus  "scrape"  is  split  into 
(scr)(n)(p«0  ,  i  is  recognized  as  a  marker  in  this  context  with  no  olhc  rs  present,  and 
the  resultant  transcription  is  processed.  It  is  of  interest  that,  to  obtain  these  accurate 
results  via  the  marking  system,  the  general  orthographic  domain  is  not  the  set 
(C).'V)(C)  as  it  was  for  purely  orthographic  studies.  The  general  orthographic  domain 
for  obtaining  a  proper  phonetic  image  is  the  union  of  the  sets  (C)(V(C»  and  ((C)V)(C)  , 
a  much  larger  set.  This  certainly  docs  not  imply,  however,  that  the  set  of  phonetic 
images  is  equivalent  to  the  set  of  aliophoncs.  This  is  because,  first,  not  every  indi¬ 
vidual  combination  of  graphic  vowel  and  graphic  consonant  needs  to  be  interpreted 
precisely  and,  second,  not  all  combinations  of  phonetic  vowel  and  phonetic  consonant 
need  to  be  interpreted  precisely  when  a  mapping  is  effected  from  one  phonetic  source 
to  another. 

Upon  completion  of  these  algorithms  and  their  implementation  to  obtain  calculated 
phonetic  transcriptions,  the  results  were  checked  entry  by  entry  against  each  source, 
errors  were  noted  and  corrected,  and  the  entire  corrected  outpu*  placed  on  a  single 
file  of  magnetic  tape.  The  next  step  was  to  reformat  the  data  in  such  a  manner  that 
pH  pronunciations  of  each  word  could  bo  displayed  at  once.  This  was  accomplished  in 
u  single  MPL1  program  which  required  only  3  m  ites  of  7004  computer  time.  A 
section  of  the  output  of  this  program  is  shown  in  Fig.  8-1. 

W*;  could  now  evaluate  the  calculated  transcriptions  ’’across  fhc  board”  by  com¬ 
paring  each  of  the  five  sets  of  pronunciation*  obtained  for  each  word  against  the  other 
four.  It  had  already  been  obvious  that  each  of  the  five  sources  possessed  its  own 
interpretation  of  the  marking  system  for  the  language.  How  do  these  interpretations 
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Section  of  Output  of  "Formal"  Program  Showing  American  Dialects 


differ  and  what  are  the  effects  of  these  differences?  Are  the  differences  so  great  that 
it  is  impossible  to  construct  an  algorithm  which  calculates  all  of  the  given  transcriptions 
at  once  for  each  word?  These  and  other  related  questions  concerning  the  value  and 
consistency  of  the  phonetic  data  arc  currently  in  the  process  of  being  answered.  In 
order  to  provide  material  for  statistical  analysis,  and  also  to  provide  users  of  the 
dictionary  a  key  to  the  status  of  each  phonetic  entry  with  respect  to  the  algorithm,  a 
cotie  was  inserted  which  describes  the  extent  to  which  any  given  entry  agrees  with  the 
algorithm  for  that  word  (Tig.  6-2).  This  phase  ol  the  work  lias  been  completed  and 
preliminary  results  were  reported  in  References  4  and  5.  A  brief  summary  of  these 
results  follows. 

Sup|x)se  that  the  observed  differences  among  the  authorities  studied  are  caused  by 
"sound  change"  lor  the  dialects  recorded.  This  hypothesis  could  be  tested  with  our 
data  by  determining  what  transcription  j>at terns  in  four  of  the  sources  corrosjx>nd  to  a 
given  pattern  in  any  single  source.  The  results  clearly  indicate  the  following  facts: 

(a)  Any  algorithm  which  could  be  constructed  to  relate  the  jihonetic  transcriptions  of 
one  authority  to  those  of  any  of  the  other  four,  would  require  more  rules  anti  a  more 
complex  logic  than  the  algorithm  which  produces  the  transcription  from  Uve  orthographic 
form  of  the  word,  (b)  The  data  show  that  rules  of  the  type  which  simply  substitute 
phonetic  .  weis  for  graphic  vowels  and  phonetic  consonants  for  graphic  consonants  are 
gross  oversimplifications  ami  lead  to  erroneous  data.  The  data  further  show  Uut  rules 
of  the  type  which  substitute  phonetic  vowels  in  one  source  for  phonetic  vowels  in 
another,  am*  phonetic  consonants  in  *nc  source  for  phonetic  consonants  in  another, 
are  imprecise  rules  to  effect  a  mapping  from  om?  source  to  another,  (e)  The  data 
indicate  a  predictable  dependency  of  vowel  values  upon  surrounding  consonant  values. 
Note  that  these  results  were  obtained  (or  corrected  data,  amt  not  just  the  < tala-gene rulm" 
outputs.  Tuts  result  tends  to  confirm  the  existence  eh  functional  relationships  between 
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Fig.  s~'J  Section  of  Output  cf  “Display”  Program 


vowel  and  consonant  values  which  the  marking  system  for  EnglLh  orthography 
predicts. 

Certainly,  a  legitimate  question  which  can  be,  and  ought  to  be,  raised  is  how 
accurate  and  consistent  are  the  transcriptions  themselves  in  each  of  the  five  sources 
we  have  considered.  Obviously,  we  cannot  chock  the  accuracy  of  each  transcription 
with  rcs|>cci  to  what  was  perceived  when  the  iranscriiHions  were  made  and  the  dic¬ 
tionaries  compiled,  but  we  can  expect  to  distill  a  grass  picture  of  some  o  the  perception 
and  compilation  problems  which  the  authorities  individually  encountered  for  this  set  of 
words.  Having  available  the  corrected  dictionary,  we  sorted  each  source  seixiratcly 
on  the  phonetic  fields  in  terminal  rhyme  order  and  examined  the  result:  There  exist, 
in  each  of  the  five  sources  studied,  uncxjioclcdiy  large  sets  of  orlhogruphicaliy  distinct 
words  which  have  the  same  phonetic  transcription. 

Two  orthographically  distinct  words  with  the  same  transcription  are  said  to  be 
"homonyms,"  and  the  set  of  all  such  words  for  a  given  transcribed  pronunciation  is 
called  the  "homonym  set"  for  that  transcription.  Our  preliminary  results  show  that 
even  removal  of  "extreme"  variants  in  pronunciation  as  well  os  recognised  dialect 
variants  in  each  of  the  sources,  does  not  significantly  affect  the  high  incidence  (aver¬ 
aging  over  40  percent)  of  homonyms  in  this  set  of  data.  Quite  apart  from  their  number, 
no  great  agreement  was  found  among  the  five  authorities  regarding  the  sets  of  hom¬ 
onyms.  Evidently,  many  factors  are  at  work  to  produce  this  result;  problems  of  sjicech 
production  and  perception,  possible  constraints  in  transcription  techniques,  and  other 
factors  enter  into  this  phenomenon.  Nevertheless,  though  the  extent  of  homonym  ads 
was  surprising  to  us,  the  existence  of  homonym*  is  predictable  in  terms  of  the  ortho¬ 
graphic  marking  system. 

At  this  point  of  our  study,  we  possessed  three  distinct  sets  of  relations:  (a)  The 
set  of  relation?,  between  the  orthographic  forms  and  the  phonetic  transcriptions  ii*  each 
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of  five  sources;  (b)  the  set  of  relations  Ixjtwccn  the  various  phonetic  forms  for  each  of 
the  rocordcd  dialects;  and  (c)  the  sot  of  relations  for  the  generation  of  transcriptions 
in  the  other  four  authorities  from  Uie  Shorter  Oxford  phonetics.  Further,  each  of 
those  sots  is  a  set  of  computable  relations.  It  turns  out  that  the  segments  of  phonetic 
forms  in  each  of  those  sets  of  relations  arc  identical.  It  will  be  rccallod  from  earlier 
discussion  that  these  relations  arc  not  vowel -for -vowel  and  consonant-for-consoiumt 
relations,  but  of  the  form  typified  by  the  orthographic  domain:  ((C)V)  and  ( V(C)>  . 

The  fact  that  the  segments  arc  identical  for  the  three  sets  of  relations  means  that  they 
arc  independent  of  the  properties  of  any  imlicuiar  transcription.  These  segments 
provide  the  necessary  and  sufficient  conditions,  without  having  to  resolve  homonyms, 
for  the  definition  of  the  minimum  segment  of  speech  perception.  In  view  of  this  result, 
the  problem  of  resolving  homonyms  takes  on  a  now  importance.  Obviously  other  infor¬ 
mation  than  that  which  has  been  used  is  required  to  resolve,  or  minimize,  sets  of. 
homonyms.  Our  current  efforts  arc  in  the  direction  of  determining  whether  gram¬ 
matical  properties  of  the  words  are  helpful.  It  is  already  apiKircnt,  however,  that 
simply  listing  ami  comparing  possible  parts  of  speech  for  homonyms  is  not  going  to  l>e 
very  helj  ul.  Larger  context  than  the  word  in  isolation  may  be  required  for  effective 
resolution  cl  homonyms. 

H  important  to  understand  that  the  existence  of  calculable  speech  segments 
which  provide  minimum  segments  of  speech  perception  nas  been  demonstrated  entirely 
without  reference  to  theories  of  perception  and  linguistic  analysis  at  a  higher  level 
than  the  rules  for  euphonic  combination,  and  analysis  of  phonetic  data  published  in 
dictionaries  There  appears  to  lie  some  confusion  in  the  minds  of  linguists  who  have 
see?)  these  result*  and  insist  that  we  are  studying  graphemic -phonemic  relationships 
amt  Vial  our  rose.  *t  unique  i  having  been  obtained  earlier).  Thu  is  simply  not 

true  First,  our  level  of  analysis  is  mnet  lower  than  Uuu  levei  of  abstraction  on  which 
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phonemic  analysis  proceeds.  We  have  no  need  lor  phonemic  analysis,  and  it  is 
meaningless  to  discuss  phonemes  at  the  level  ol  our  data.  Socond,  our  results  are 
obtained  from  completely  self-contained  algorithms.  So  far  as  we  have  been  able  to 
determine,  no  other  operational  procedure  in  this  field  obtains  so  much  from  so  little. 
We  have  shown  that  the  results  obtained  t>y  our  methods  do  not  depend  ujx>n  any  j>ur- 
tlcular  transcription,  and  furthermore  the  transcriptions  we  have  used  are  readily 
available  in  any  library;  it  is  not  impossible  to  reproduce  our  data  and  our  results. 

In  any  event,  this  is  demonstrably  not  possible  for  procedures  based  upon  phonemic 
theories.  Third,  we  have  seen  no  comparable  results  anywhere  else. 
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ii  amvkact  The  second  annual  report  on  automatic  indexing  and  extracting  consists  of  »  f  |»ci 
summarising  progress  in  three  areas  of  investigation:  (1)  Application  of  English  word  .  r- 
tihology  to  automatic  indexing  and  extracting ;  (2)  Use  of  combined  syntactic  and  entroj 
selection  criteria  in  automatic  indexing;  (3)  .Studies  in  phonetic  English.  The  first  four 
papers  are  concerned  with  the  relationship  between  the  part  of  speech  of  words  and  their 
graphic  form.  An  operational  definition  of  affixes  is  dvpn,  the  usefulness  oi  affixes  in  the 
automatic  determination  of  parts  of  speech  is  discus  d.  and  an  algorithm  is  outlined  for 
determining  parts  of  speech  with  a  dictionary  look-up  of  less  than  200  affixes  and  less  than 
sod  words.  The  inflection  of  adjectives  is  also  discussed,  anticipating  the  need  for  future 
refinement  of  the  part -of -speech  algorithm,  which  at  present  identifies  U  part -of -speech 
categories.  For  some  objectives  these  categories  may  be  inadequate,  necessitating  further 
breakdown.  for  example  adjectives  might  be  further  distinguished  as  relative,  comparative, 
etc.  The  fifth  paper  is  a  progress  report  on  the  development  of  a  method  for  automatic  in¬ 
dexing  without  reference  to  any  pre -prepared  dictionary,  thesaurus,  etc.  ft  shows  the 
current  results  on  five  text  excerpts.  The  final  three  papers  are  concerned  with  the  rcia- 
i  tionship  between  English  phonetics  and  English  morphology.  One  of  the  papers  is  concvmw 
with  homonyms.  which  represent  a  problem  area  in  transformation  from  phonetic  to  graphic 
!  English.  Another  discusses  a  function  for  mapping  written  English  into  spoken  English,  ant 
the  third  describes  a  computerised  study  of  transcribed  English  phonetics  as  given  by  dif¬ 
ferent  dictionaries. 
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